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ABSTRACT 

Seven papers commissioned by the National Institute 
of Education in order to clarify the state of recent knowledge about 
the effects of school desegregation on the academic achievement of 
black students are contained in this report. The papers, which 
analyze 19 **core** empirical studies on this topic, include: (1) **What 
Have Black Children Gained Academically from School Integration? 
Examination of the Meta-Analytic Evidence,** by Thomas D. Cook; (2) 
"The Evidence on Desegregation and Black Achievement," by David J. 
Armor; (3) "Is Nineteen Really Better Than Ninety-Three?" by Robert 
L. Crain; (4) **School Desegregation as a Social Reform: A 
Meta-Analysis of Its Effects on Black Academic Achievement,** by 
Norman Miller and Michael Carlson; (5) "Blacks and 'Brown': The 
Effects of School Desegregation on Black Students," by Walter G. 
Stephan; (6) "Desegregation and Education Productivity,** by Herbert 
J. Walberg; and (7) **School Desegregation and Black Achievement: An 
Integrative View," by Paul M. Wortman. The 19 core studies examined 
in these papers were selected, based on their content and quality, 
from 157 works that looked at black students' academic achievement in 
desegregated schools. Authors of the selected works are Lewis V. 
Anderson, Jerome Baker, Orrin H. Bowman, Patricia M. Carrigan, El 
Nadel Clark, Charles L. Evans^ E. F. Iwanicki and R. K. Gable, Robert 
Stanley Klein, M. A. Laird and G. Weeks, George J. Rentsch, L. W. 
Savage, Daniel S. Sheehan, Irene W. Slone^ Lee Rand Smith, the 
Syracuse City School District, E. W. Thompson and U. Smidchens, D. W. 
Van Every, Herbert J. Walberg^ and Stanley M. Zdep. (GC) 
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INTRODUCTION 



The Naclonal Institute of Education has undertaken the most comprehensive 
and rigorous analysis to date oi the effect of desegregation on Black 
student academic achieve^ient , NIE commissioned papers from seven eminent 
scholars to clarify the state of research knowledge about the effects of 
school desegregation on the academic achievement of Black students, and the 
seven scholars are Thomas Cook of Northwestern University, David Armor of 
Dax'id Armor Associates, Robert Crain of the Rand Corporation, Norman Miller 
of the University of Southern California, Walter Stephan of New Mexico 
State University, Herbert Walberg of the University of IllinoiS'-Chicago 
Circle, and Paul Wortman of the University of Michigan, 

They were selected for their past extensive work on desegregation research, 
prominence in the field, knowledge of research methodology, and divergent 
viewpoints about the effects of desegregation on Black student academic 
achievement. NIE's intention was to find if under similar conditions, with 
the same set of data, and common ground rules, similarities and differences 
in analyses could be identified and clarified. 

The seven scholars met first to discuss the state of research literature 
and to agree on a comprehensive list of criteria to be used in selecting 
the studies to be analyzed, A total of 157 empirical studies were 
identified that looked at Black students' academic achievement in 
desegregated schools. A comprehensive and rigorous list of criteria 
(listed below) were adopted and applied to the total set. This process 
resulted in a "core" of 39 highest quality studies (listed below) on this 
research topic, which the scholars then statistically analyzed to reach 
their individual conclusions. This analytical effort is a significant 
improvement over previous attempts at reconciling the controversial 
literature on this topic, and it is hoped that this effort by NIE will 
prove helpful to all parties concerned with the nationally important 
subject of school desegregation. 



CRITERIA FOR REJECTIOtJ OF A STUDY 



Type of Study 



a) Tion empirical 

b) summary report 



Location 



a) outside USA 

b) geographically non specific 



Comparisons 



a) not a study of achievement of desegregated Blacks 
(except In cases where we use a White comparison) 

b) multl-ethnlc combined 

c) comparisons across ethnics only 

d) heterogeneous proportions minority in desegregated 
condition 

e) no control data 

f) no pre-desegregatlon data 

g) control measures not contemporaneous 

h) excessive attrition (review must provide specific 
justification for the inclusion of stndies vith 
excessive attrition* but amount was not specified) 

1) majority Black In a segregated condition (unless 
the reviewer provides specific justification) 

j) varied exposure to desegregation (unless the reviewer 
provides a specific justification demonstrating 
that the variation in exposure time Is not meaningful) 

k) groups are initially non-comparable (unless the reviewer 
provides a specific justification that the amount of 
divergence is not meaningful) 



Study Desegregation 



a) cross"sectional survey 

b) sampling procedure unknown 

c) separate non-comparable samples at each observation 



Measures 



a) unreliable and /or unstandardized instruments 

b) test content and/or Instrument unknown 

c) dates of administration nnknown 

d) different tests used in pretests and posttests 

e) test of IQ or verbal ability 



Data Analysis 

a) no pretest means 

b) no posttest TReans» nnless the anthor reported pretest 
scores and gains 

c) no data presented 

d) N*s not discernible 



o 
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Vhat Have Black Children Gained Academically 
From Schoo] Integration?: 
Examination of the Meta-Analytic Fvidcnce 

Thomas D- Couk 
toorthwestern University 



INTKODUCTION 

My assignment to comment on the following essays by Armor* Grain, 
Mijler, Stephan, Walberg and Worrnan in order to help readers decide 
what should he concluded from their evaluations of how school 
desegregation has affected the academic achievement of black children- 
All but two of the essays contain a meta-analysis by the author* 
Grain's p^per is one of the exceptions* Instead of conducting a 
meta^analyf; J s , he critically dificiKsses some of the assumptions behind 
the others' efforts and concludcsv that he wil] ^jtand by the results oi 
his own prior neta-analy t ic work (Grain & Mahard* 1983), 1 shall refer 
to his prior meta-analysis basc-d on 93 studies more than to his essay in 
tliis volume- Valberg is the ether exception. He devotes most of his 
essay to a review of factors other than desegregation that raise 
academic achievement- He does this to make the point that* if the 
purpose of desegregation is to rai.se the achievement of black children, 
then more effective means exist to do this than desegregation. Walberg 
tioes » however, reanalyze three pri^r ceta-analysies — by Krol (1975), 
Grain ^ Mahard {1982), and Wortman* King, and Eryant (1982) — in order to 
make the further point that* in his estirtation* the average effect sizes 
they present do not reliably differ from zero. 1 intend to deal with 
his statistical analysis to a smali extent, but w-'ll not deal directly 
with his larger point about rexative efficacy* 

The first part of the present paper deals with the meta-analytic work of 
Armor* Miller> Stephan and Vortman, ard is largely restricted to the 19 
studiefi selected by the panel t The purpose is to arrive <5t an estimate 
for this sample of how desegregation U^s: ?iffected the achievement of 
black children- I try to restrict my coirjuentary to the most important 
points and assumptions made by the authors* and make no attempt at a 
comprehensive analysis of any single person's work in order to be 
comprehensive about its strengths and weaknesses. This is to keep the 
focus on the desegregation issue. In the second part of the paper,! 
take my own results* which are both similar to and different froir those 
of the panel* and discuss several ways they can be interpreted- In 
particul;ir, 1 asli how generalizable are results from the panel's 19 
studies when they are cumpared to the results from larger data bases; 1 
probe the e:cttar.t to which TuV findings speak to th^* information needs of 
groups with different stakes in school desegregation; and I speculate 
about whose interests the l^iirel's results might advance or prejudice* 



1. The Stud ies Exair.ined * Individual panel irembers considered different 
subf^ets; of the 19 £t\idies that laost of then deemed inethodologieally 
adequate* Armor dropped the study by RentKch on grounds, first, that 
the desegregated group acfl the segre^jated controls differed hy so much 
initially; second, that the pretests and posttests involved differcrt 
measures; and third, that the desegregated control group contained some 
white children. He also dropped the study by Thompson L Smidchens on 
crounds that the segregated controls were in classes made up of only 42% 
ipinority i^tudent^* However, he included the study by Carrigan, even 
though its segregated control group members were in classes that were 
hardly more ''segregated" — 50% minority* Indeed, Miller and Stephan 
(Jropped the Cr,rrigan study because of its questionable control 
group* Inia few other cases. Armor selected control groups within 
a ^tudy that differed from the choice of all other panelists* The 
net result of Armor's preferences was lower effect sizes since (1) 
Reutsch obtained some of the largest effect sizes; (2) Carrigan 
resulted in both positive and negative effect sizes; and (3) both 
Rentj?ch and Carrigan involved m\iltipie comparisons, so their results 
were disproportionately weighted whenever comparisons were the unit of 
analysis rather than individual studies. 

Miller dropped both Carrigan and I'hompson and Smidchens from his 
analyses because the segregated controls were not segregated* He also 
differed fror. the other analysts in preferring tc compute an effect size 
per study instead of per comparison* Much has been written in the 
meta-analysis literature on this topic, and our preference is to compute 
or report effect sizes each way. However, if only one choice is 
available, we favor a sample of studies because this does not weight the 
results in favor of school districts where desegregation was tested 
using several grades* 

Stephan also omitted the studies by Carrigan and by Thompson & 
Smidchens* However, he also objected to the studies by Iwanicki & Gable 
and Slone on grounds that they dealt with the second year of 
desegregation while other studies dealt with the first year* He further 
objected to Slone because the segregated controls were attending a 
school that was 40% white* This left Stephan with only 15 studies to 
analyze- Siiice the studies he omitted all tended, with the exception of 
Slone, to have zero or negative effect size estimates, it is clear that 
Stephan 's sampling decision disposed his analysis towards a larger 
average effect size than other panelists* 

\;ortman differed from the other panelists in two important ways* First, 
he preferred his om selection of 31 "superior" studies to the panel's 
19* However, his analyses of the 33 showed that designs without control 
groups produced higher effects size estimates than designs with control 
groups. Hence, 1 treat his analyses based on studies u'ith controls 
differently from the analyses without controls for, among other possible 
artifacts, maturation and testing effects can inflate estimates of the 
desegregation effect. Second, in his analyses of the panel's 19 
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studies* Wortman was more snrict: than the others about what lie would 
accept: as valid inf ormanion about: variances. Since such information is 
crucial for computing effect sizes he was able no produce estimates chat: 
also controlled for pretest differences between the desegregated and 
sey.regated control groups for only 11 of the 19 studies favored by the 
panel* One of these was the study by Carrigar* Omitted were Clark* 
Evans* Iwanicki & Gable, Klein* Laird & Weeks* Slone* Syracuse, and 
Thompson & Smidcheos* Since Wortman preferred somewhat different 
standards of methodological adequacy than the panel* I sometimes include 
estimates computed from his analyses of the 11 panel studies* and at 
other tines estimates based on the larger subset of his preferred 
studies that involved designs with control groups* These studies should 
overlap heavily with the panel's selection criteria* 

The panelists provided estimates for reading and math combined* for 
reading alone, and for math alone* , It is interesting to note that there 
is no obvious relationship between gains in mathematics and reading when 
the desegregated are compared to the segre^^ated* To compute a 
correlation of reading and math gains would not be useful because of the 
small number of studies and comparisons for which there were measures of 
both reading and mathematics gains. However, of Anccr's 18 relevant 
comparisons* math and reading gains had the same sign in seven 
instances, different signs in eight, and three instances were 
indeterminate because of zeros* Of Killer*s 13 comparisons* seven had 
the same sign and six the opposite; while of Stephanas comparisons there 
were 13 with the same sign, 11 with the opposite, and one was 
indeterminate* Math and reading gains were not clearly related* and 
little is gained by adding them together* Consequently, I prefer to 
present results separately from each knowledge domain* However, for 
purposes of continuity with the panelists some of my reanalyses will 
involve reading and math scores combined* When that happens, my 
analyses — like those of the panelists — weight reading slightly more than 
math because more reports included reading than math measures* 

2* Panelists* Results* Using his own preferred set of studies based cn 
a sample of comparisons* Armor obtained an effect size of ,06 for 
reading and *01 for math; Miller obtained an effect size cf *16 for 
reading and *08 for math; Stephan's values were * 15 and *00; vhile in my 
analysis of Wortman' s resutls for the eleven studies with pretest 
adjustments, the mean effects were ,26 and .08* Cwortman's own results 
from the panel*s 19 studies were *28 and *23* but this includes studies 
where no pretest adjustments were made* His estimates from his total 
sample of 31 studies were *57 and*33, but these are based on some 
studies without control gorups* Thus* I consider both of these last 
sets of estimates to be problematic)* 

If we turn now to estimates of reading and math combined, Armor^s 
overall estimate was ,0A* Stephan's was *1A (but *07 when computed as 
gain per S-raonth school year), Miller*s was .12, while Wortman*s was .17 
derived from the studies of his own choosing that had control groups* 
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Ji one cook the panel's estinates at fnco value they would appcrir to 
support the following conclusions; 

a. Desegregation did rot caust' a decrease in the achievCEient of black 
children , 

b. It probab.ly did not cause an increase in math skills » for the mean 
gains vary from 0 to ,08 standard de:viatioti units, 

c. It may have caused an Increase in residing skills, for the tnean 
j:ains vary froci ,06 to ,26- 

The range estimate lor reading deserves catnnent , since che upper 
bound comes from our analysis of Wortman's eleven studies where 
pretest adjustments could be made. This is a considerably smaller 
sample than the other authors analyzed, and so should be treated as 
particularly tentative. Omitting it gives a revised range that 
petmits a fourth conclusion, which 1 believe to be better justified 
than the third conclusion irariediately cibove, 

d. The gain in reading was somewhere between ,06 and ,16 standard 
deviation units* This is between two and six weeks of gain i£ we 
follow the rule of thi^mb of Glass e t al f ] PSl) and associate a gain 
of one-tenth ot a standard deviation with one month*s gain in 
knowledge , 

The small discrepancies between the panelists in mean estimates 
principally reflect differences; in (1) the studies included for review; 
(2) the way effect sizes were computed; and (3) a preference for some 
types of control groups over others within a lew^ studies, I shall 
resist the temptation to discuss each of these issues in order to make 
judgments lor each of them about the meChodolcgicai option to be 
preferred, after which point estimates of gains could be computed. 
While such an exercise would result in easily remembered single number 
estinates of reading and math gains, the resulting precision would be 
misplaced. In meta-analysis, varying the assumptions underlying an 
analysis is desirable because it makes heterogeneous those facets of 
research where no *^right" answer is available and fallible human 
judgment is recuired. To attempt to legislate a single "right" way 
either to compute eifect sizes or to sample studies vould be 
counterproductive so long as ncne of the analysts is clearly wrong* 
Indeed, the idea of selecting a panel of cethodologically sophisticated 
experts with different views on school desegregation if: predicated on 
the particular utility that would result if the panel*s estimates of 
desesiegation*s effects converged despite the differences in values and 
methodological predilections ot individual panelists* Jt is more 
reasonable to expect "convergence" as a range than a point* To search 
for the elusive *'true" point estimate of effect could involve laborious 
debates about fine points of methodology and substance that might occur 
within a range of estimates that many would think has few practical 
implications* 
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speaking personiilly, I am Impressed by the degree of correspondence 
between the panelists when only the 19 core studies are considered. 
None achieves negative estimates; all achieve larger estimates for 
reading than math; srd the largest single difference — between Armor and 
Killer for reading gains — is of a magnitude many would consider 
Small — viz,* a difference of about one nioTith of gain. 

The convergence is all the more dramatic since, across all dependent 
variablest Krol obtained an estimate of ,10 from his own meta-analysis 
of "better" desegregation studies, while a similar estimate resulted 
from Craln & Kahard (1983) when one aggregates across all their 
dependent variables for ths randomized experiments and studies with both 
pretest-poisttest measurement and control groups of segregated black 
children- Combining math and reading and analyzing only the studies 
preferred by the present panelists* Armor's estimate was ,04, Miller's 
was ,12» and Vortman's was ,17 for all the studies he found with 
pretests and black control groups, while Stephan's estimate was ,14 
without his correction for the length of time desegregation had been 
taking place — a correction that none of the other panelists made. The 
average of the panelists' values is *11» only slightly higher whan the 
estimate obtained by Krol and Craln & Mahard, (However, as we later 
see* Grain rejects this estimate, preferring to base his judgment on 
studies where desegregation occurs at kindergarten or first grade,) 

3, The Distribution Problem, As a measure of central tendency the 
mean depends on a normal distribution of scores. In Figures 1 through 

4, we present frequency distributions of reading effect sizes for Armor* 
Miller* Stephan* and Wortman based on the studies they ehose to analyze, 
(For Mortmain we add the math data since he presents reading effect sizes 
for only eleven studies where pretest adjustments were made, and this 
results in a particularly poor estimate of the distribution). In all 
cases except Miller* the sample sizes are based on comparisons rather 
than studies. But irrespective of the unit of analysis* the 
distributions are visibly skewed* with a disproportionate number of 
effect sizes falling in the upper range. 

Table 1 presents the medians and modes corresponding to the reading 
mean. The median Is computed £or a sample of both comparisons and 
studies and is defined as the value of the (t3+l)/2th case. To compute a 
mode with so few cases* we constructed a scale composed of categories 
with intervals of ,10 standard deviation units whose midpoints are 
presented in Figures 1-4, Each effect size was assigned to its 
respective category* with scores of zero being assigned in equal 
proportions to the category 0 to ,10 and 0 to -,10, For Miller* no 
value is reported for the median of comparisons since he only provided 
data on studies. Sometimes, no mode is presented for Wortman because 
his smaJ.ler sample of studies from the panel's set that had pretest 
adjustments often makes it difficult to determine any modal category 
with more than three cases falling into it* 

Table 1 shows that mean effect sires for reading are larger than median 
effect sizes irrespective of whether the latter are computed as a mediar 
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Table 1 

Central tendencies for Reading - Author's ovn Preferred Studies 



Wean 



Median of 
Comparisons 



Median of Midpoint of Modal 

Studies Category of Comparisons 



Armor 
Miller 
Stephan 
Hor tican^ 



.06 
.16 
,14 

.26 



.00 

.08 
.15 



.00 
.06 
.08 
.04 



-.05 & +.05 
-.05 i +.05 
+ .05 



^ In Wortinan's case "preferred" studies refers to those of his selection from the 
panel's core 19 for vhich pretest adjustments could be made. It does not refer 
to his analysis of 31 studies. 
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of comparisons or of studies* It also shows that the mode is smaller 
than the othor Pleasures of central tendency and hovers around zero* 
Tndeedf the mean of the mean effect sizes across ?11 four panelists is 
*15» the mean median of comparisons is *08* the mean median of studies 
is *05t while the modal categories are of effects between +*05 and -*05* 

Table 1 was recomputed based on the 17 core studies most panelists 
agreed upon* That is, Thompson & Smidchens was omitted since three of 
the four panelists who did meta-analyses questioned it; and Carrigan was 
omitted since at least two of the panelists objected to the questionable 
nature of their "segregated*' controls* In computing the data for Armor* 
the missing values for Rentch we re talc en f rom Wor tman * St ephan pr ov ided 
his own estimates for the studies by Twanicki & Gamble and Slone that 
he preferred to leave out of most of his own analyses* As Table 2 
shows » having a comioon pct of studies reduced the dispersion of mean 
effect size for reading* The range for the panelists — Wortman excepted 
because his analysis is not based on the 17 studies » and 1 did not want 
to take his six missing estimates from other panelists since that would 
involve estimating about 30% of the scores — the range shifted from 
*06 — *16 to *13 — *16* However, even with the same 17 studies per 
analyst, the table still shows that medians are lower than means* and 
that modes are lower than medians* 

A correspondirg table for math from the author's own preferred set of 
studies is in Table 3* Modes could not reasonably be computed due to 
the smaller number of math than reading comparisons* However, the means 
are consistently higher than the medians* 

Combining math and reading allows modes to be computed again and results 
in the same basic relationship between measures of central tendency* 
This is true whether one uses the author's own set of preferred studies 
(Table 4) or the common set of 17 (Table 5)* The individually preferred 
studies produced a range of mean estimates from *06 to *16* or median 
estimates from *00 to ,08* and of mode estimates from -*15 to +*05* 

These differences in central tendency result because the distribution of 
effect sizes is skewed* The skewness means that, if one vere willing to 
assume that the present results; are applicable to the nation at Isrge 
today — a dangerous assumption"then (1) for any school district, that 
desegregates the most reasonable expectation is that there will be no 
effects on black achievement, for the mode suggests that this outcome is 
obtained more often than any other; (2) 50% of the school districts vill 
probably raise achievement by about three one— hundredths of a standard 
deviation (the average median of studies across the panelists), while 
50% of them will probably raise it by less than this; but (3) the 
national impact will be to raise the achievement of black children in 
reading by between two and six weeks and to raise achievement in math* 
if at all, by something less than three weeks~the upper range of mean 
estimates* However* (4) a minority of school districts could expect to 
make larger positive gains* Using Killer's reading estimates for the 
moment, larger gains appear to have been obtained by Anderson (*733), 
Beker (*400)* Syracuse (*691)* and Zdep (*671)* In mathematics, the 
outliers were less common but still visible (Anderson *669* Klein *333* 
and Van Every .543)* 
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Table 2 

Central Tendencies for Reading - 17 Coranon Core Studies 



Mean 



Median of 
Comparison 



Median of Midpoint of Modal 

Studies® Categorv of Comrar isons 



Armor ^ -13 
Miller^ ,16 
Stephan^ -13 



WortiDan^ 



.26 



,03 

.07 
.15 



0 

.06 
.OS 
.04 



--05 £r 4^,05 
-,05 i '*^-05 
'*^-05 



^ Based on N of comparisons; Carrigan and Thompson & Smidchens omitted; 
Rentsch added and given Vortman values, 

^ Based on N of studies; Carrigan and Thompson & Smidchens onitted- 

^ Based on N of comparisons; Carrigan a^id Thompson & Smidchens omitted- 
Thus» Iwanicki £i Gable and Slone added- 

^ Based on K of comparisons* The sample size is considerably smaller than 

with other analysts, since Vorcman omitted all instances where the ccntrol group 

standard deviation was not specifically given. This resulted in the omission 

of Clark, Evans, Iwanicki & Gable, Klein, Lard & Veeks, Slone, Syracuse, and 
Valberg, as well as Carrigan and Thompson & Smidchens* No mode was ascer- 
tainable * 

® The medians are from Miller*s Table 2 for each author based on N of studies rather 
than comparisons. 
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Table 3 

Cencral Tendencies for ES Values in Math - Author's ovn Preferred Studies 



Mean 



Median of 
Comparison 



Median of Midpoint of Modal 

Studies Category of Co-parisons 



Armor 
lliller 
Stephan 
WortEian 



.01 
.08 
.0- 
.03 



-,05 

.02 
-.02 



-.06 
.07 
.02 

-.05 



^ In Worcman's case "preferred" studies refers to those of his selection fron the 
panel's core 19 for which pretest adjustments could be made. It does not refer 
to his analvsis of 31 studies. 
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Table ^ 

Central Tendencies for Reading and Hath Combined - Authors' outi Preferred Studies 



Mean 



Median of 
Comparisons 



Median of Midpoint of Modal 

Studies Category of Coinparisons 



Armor 
Miller 

b 

Stephan 
Wortman^ 



,06 

,07 
.16 



.00 

.05 
.08 



,00 
,06 
,05 
.01 



-,05 

-,15 & +.05 

-,05 

-,05 



^ In Vortnar.'s case "preferred" studies refers to those of his selection frm the 
panel's core 19 for which pretest adjustments could be nade. It does not refer 
to his analvsis of 31 studies. 



''"hese are estimates per school I'^o-r 
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Table 5 



Central Tendencies for R<?ading and Math - 17 Common Core Siudiep 



Mean 



Median of 
Conparisons 



Median of Midpoint of Modal 

Studies^ Category of Conparisons 



Armor 
Miller^ 
S tephan*^ 
VortiTian^ 



.08 
A2 
.07 
.16 



,03 
.06 



0 

-06 
.06 
.01 



-.05 

-.15 & +,05 
+ . J5 
-*05 



Based on N of comparisons; Carrigo" and ThoT::pson & Smidchens OTnittec; 
Rer^isch added and given Vortrrtan values* 

^ Based on K of studies; Carrigan and Thompson £< Smidchens omitted, 

^ Based on N of coiTiparisons ; Carrigan and Thompson i Smidcher^s onitted. 

Thus, Icanicki & Gable and Slone added* EsriTnates of effect per school year, 

^ Based on K of coinparisons* The sample size is considerably smaller than with 
other analysts, since Wortman omitted all instances vhere the control group 
standard deviation vas not specifically given. This resulted in the omission 
Clark, Evans, Ii^*anicki & Gable, Klein, Laird £< Weeks, Slone, Syracuse, and 
Valberg, as well as Carrigan and Thompson & Snidchens, 

^ The medians are from Miller's Table 2 for each author based on N of studies 
rather than con)t>arisons . 
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But Stephanas estimates make the studies with outlying results seem less 
e>;tretne, and some different ovitliers emerge* He computes effect si^es 
in a way that controls for the length of time children have been un<?er 
study in a desegregated school* When reading effect sizes are computed 
per eight-month *;chcol year, the outliers are pulled In because they 
tended to cone from studies lasting two or three years, Xhe new values 
tire; Anderson (,42), Baker (,13), and Zdep (,66), fStephan leaves 
Syracuse out of his sample)* For mathematics, the positive outliers now 
become: Anderson (*24), Klein (•SS), and Van Every (,14)* Stephan's 
computation of effect sizes leads to less variable and less skewed 
estimates than the other panelists, which is why medians and modes make 
less of a difference to his computations of central tendency than to 
others* But the choice of a measure of central tendency still makes a 
difference in Stephan's estimates, for both reading and reading and math 
combined* 

However, Stephan's work does present a pu^^le* He ±s the sole panelist 
to compute a median, and about midway in his report he mentions that the 
median gAln in verbal achievement (reading) is *13* (His corresponding 
means were *17 for the sample of comparisons and *15 for the sample of 
studies*) J have examined Stephan's effect si^es from his Table 1 and 
have been unable to arrive at the same value* My own estimate based on 
a sample of comparisons and omitting the studies he leaves out is ,08* 
Readers should scrutinize Stephan's Table 1 and estimate for themselves 
the effect si^e for reading scores above which 50% of the effect sizes 
fall and below which 50% fall* 

4* The Confidence Problem* Our reanalysls of the panelists' studies 
using multiple measures of central tendency should not be Interpreted to 
mean that, in our opinion, desegregation has had no effect on most 
schools* There are two reasons for a low level of confidence in the 
results presented in Tables 1 through 5* First, we do not know the 
underlying distribution of mean effect sizes (however computed) for the 
population of school districts that have already desegregated* It is 
not clear how representative the panel's core set of studies are* 
Second, with so few comparisons znd studies, we cannot have much 
confidence in the sample distributions presentee! in Figures 1^4, A 
dozen new cases could radically alter each cf the estimates of central 
tendency* With such a poorly estimated and unstable distribution, it is 
not clear that the mean would remain unchanged even if more cases were 
added from the very same population that the present sample is supposed 
to represent* 

Statistical significance test^ are typically used to nake Inferences 
about the level of confidence one should ascribe to findings* (Because 
of lay misunderstandings of the word "significance," ve prefer to talk 
of tests of statistical reliability rather than statistical 
significance*) Walberg has maintained that for measures of math and 
reading combined, none of the estimates obtained by Krol, Grain & Kahard 
and Wortman, King & Bryant reliably differ from zero* In the current 
case, our calculations of reliability indicate that; (1) for Armor, the 
mean estimates for math alone and for reading and math combined do not 
differ from zero, but the estimate for reading does so marginally 
(p is less than *10); (2) for Killer, the estimate for math does not 



reliably differ from zero* but the estimates tor reading alone and for 
reading and cath combined do so; (3) for Stephan, the effect for math is 
not reliable* while for reading and for math and reading combined* 
conventional levels of statistical reliability are reached irrespective 
of whether the mean is computed with or without correction for the 
length of desegregation; and (4) for Wortman* the effects for reading 
and for reading and math combined both differ from zero even when we 
consider only the small sample of studies with pretest adjustments. 

These statistical tests are themselves partly problematic. In all cases 
except Miller, the analyses are based on a sample of comparisons. But 
since some studies produce more than one estimate of effect size, the 
assumption of independent errors may not be met. This particular 
problem does not occur in Miller's analysis. There* the small sample of 
studies increases the dependence on the assumption of a normal 
distribution of effect sizes. But as the difference between the various 
measures of central tendency indicates* the distribution of effect sizes 
may not be normal. Hence* all the statistical test results reported 
above (and in Walberg) should be treated with some caution. As they 
stand* they suggest that neither the mean reading effect nor the mean 
effect for reading and math combined is due to chance. 

However, to complicate matters, it is not likely that the medians and 
modes differ from zero. The standard error of a median is normally set 
at 125% of the value of the standard error of the means from the same 
distribution, reflecting the greater instability of medians. By this 
criterion* no medians reliably differ from zero for reading or for 
reading and math combined, No estimate of the reliability of modes is 
necessary since they hover so closely around zero. However* the medians 
and modes are based on so few cases that estimates could shift radically 
once a dozen new values are added to the distribution. 

If the population of effect sizes is indeed skewed, it is not clear 
which measure of central tendency is to be preferred. The mean 
represents national impact at some abstract* aggregate level* and is of 
use to those persons and groups most interested in gaining a national 
perspective on education and society. The mode represents what should 
happen to the typical school, and so may be of most interest to any ' 
school district or judge considering desegregation, especially if the 
district in question deffers from those where desegregation has produced 
large impacts in the past — characteristics we shall explore below, For 
any coinmentator willing to assume that the distribution of effect sizes 
in the population approximates the (unclear) .sample distributions we 
have obtained* it is important to decide at a high level of 
consciousness on the different utilities implicit in different measures 
of central tendency, 

5, Why Do Some School Districts Show Larger Gains in Reading ? The 
skewness in the distributions indicates not only that the mean may be a 
misleading measure of central tendency, but also that it might be 
productive to probe the reasons why some school districts are outliers. 
Discovering what they did to achieve largei> gains could, for instance* 
be used to develop specific guidelines for desegregation plans, which 
school districts could then select if they believed they were suitable 
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using Miller's estimates* was .70* in Beker was .19* and was 
*26 in Zdep. (it was .035 for Rentsch in Miller's an^ilysis.) 
When a joint criterion is used to define outliers* only 
Anderson clearly emerged. Indeed* the three other studies 
had negative estimates for math* Pursuing the instability 
theme further leads us to note that the second largest 
negative outlier for reading (Van Every, ^*17) is based on a 
desegre- gated sample of only 20, and the math estimate is 
+.54. We are not arguing that desegregation should have 
affected both reading and math* We are only suggesting that 
we would be more confident of having identified valid outliers 
if reading and math gains were correlated among the potential 
outliers* 

The third methodological issue concerns how effect sizes were 
computed* All the panelists are commendably sensitive to the 
need to control for differential growth rates between the 
nonequivalent desegregated and segregated control groups, and 
all go about the task in similar — but not quite 
identical — ways. The adequacy of statistical adjustment for 
selection-maturation depends on many factors* including the 
(unknown) true selection difference, the reliability of 
measures, the comparability of within-group regressior. lines* 
etc. In meta-analysis, the hope is that , across all the 
studies examined, the inevitable imperfections in the analysis 
of any one study will even out so that the average bias due to 
selection^aturation will be zero* However, there is no 
presumption that the bias will be zero in any single study* 
Yet in analyzing outlier effect sizes* one has to assume that 
the average selection and selection-maturation bias 
among the outliers is zero. However* one might easily have 
capitalized on chance and have isolated the subset where 
adjustment has been the least adequate* Indeed* in four of 
the five outlier cases the desegregated children outperformed 
the segregated initially, and in the other cases the means 
were essentially identical. 

Thus* the possibility cannot be ruled out that the outliers 
reflect: (1) sampling instability due to small sample size;?; 
(2) sampling instability that makes high reading gains not 
synonymous with general achievement gains; and (3) an 
underadjustment for initial group differences in reading 
achievement* It is within the limitations afforded by these 
three points that I now examine substantive characteristics of 
the outliers for reading* . 

The Characteristics of Outlier School Districts * As 
previously discussed* one characteristic of the outlier school 
districts on Miller's list is that they evaluated longer 
periods of desegregation — up to three years in some cases* 
The relationship between effect sizes and length of 
desegregation is not clear due to sampling instability* with 
all the panelists who tackled the issue concluding that effect 
sizes seem larger in the five studies with two years of 
desegregation chan in the nine studies with one year of 



desegregation. However, estimates seem to be lowest of all in 
the three studies with three years of desegregation. Since 
two-year studies predominate among the studies with larger 
effects in Miller's Table 2, it suggests that effect sizes may 
be related to the amount of desegregation that has taken 
place. 

The predominance of two-year studies among the districts with 
larger effects also leads me to prefer Stephan's estimates for 
defining outlier school districts. But to use his data, I 
averaged his estimates across grades to give a single reading 
mean per study. The outliers fall into two groups: Anderson 
(,49), Syracuse (,58) and Zdep (,66) are in the one, and Klein 
(,23) and Kentsch (*22), in the other. Even listing these 
outliers raises once again the specter of instability, since 
Klein would not be an outlier for Miller, while Beker would be 
for Miller but not for Stephan! 

Two substantive factors are associated with Stephan's larger 
effect sizes. One factor concerns when desegregation takes 
place. Figure 6 shows effect sizes per eight months of 
desegregation plotted against when desegregation began. The 
latter values are taken from Wortman rather than Stephan, 
since the information about grades in Stephan's Table 1 appears 
to be based on the grade at which desegregation began in some 
cases and on the grade when it ended in others. Figure 6 
shows a clear negatively accelerated decay curve, with -larger 
effects the earlier the desegregation. Hone of the panelists 
obtained effects of grade on achievement that were as clear 
cut as this, probably because they computed linear 
relationships* truncated at inappropriate grade levels, did 
not adjust effect sizes for the length of desegregation, or 
they assessed the grade of children when the study ended. 
Figure 6 suggests that at second grade, a gain is obtained of 
about .30 standard deviation units per eight-month 
year — though this estimate is based on only four studies 
— that at the third grade the gain is ,12 (five studies), 
while it is ,14 at the fourth grade (based on the nine 
studies) . 

In trying to explain why a small set of school districts 
produced large reading gains that skewed the distribution of 
effect sizes, it is important to probe whether the 
desegregation was voluntary or mandatory. According to 
Crain*s report in this volume, all of the school districts I 
have identified as positive outliers had voluntary programs. 
This is perhaps not surprising, since the programs were 
voluntary in 15 of our 19 studies. For reading, only three 
school districts showed overall negative effects in Stephan's 
analysis~Sheehan & Marcus (-,07)* Smith (-^,01) and Van Every 
(-,12). The first and last of these were mandatory programs. 
Of the two other mandatory programs in the panel's sample, the 
study by Carrigan vas omitted from some analyses but, when 
aggregated across grades, it produced a small negative effect. 
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The other mandatory study produced a trivial gain of *02 
across grades (Evans)* It is clear» then» that mandatory 
programs were not associated with reading gains but that 
voluntary programs were* 

However* the relationship between effect size and the 
voluntary/mandatory nature of desegregation could only be 
considered causal for these four cases of mandatory 
desegregation if all other interpretations of the relationship 
could be ruled out* However, two of the studies"Evans and 
Sheehan & Marcus— were done in Texas* were the only ones to 
use the Iowa Test of Basic Skills* and were two of the only 
three studies of desegregation activities that began in the 
1970*s* (The other study with apparent negative outcomes™Van 
Every — took place in Flint* Michigan* began in 1^69* used the 
SRA test* and had very small samples*) 

Just as it would be wrong to conclude with confidence that 
mandatory programs produce no gains in reading* so it would be 
wrong to conclude from the panel's core studies thaf 
desegregation beginning in the earlier grades results in 
larger positive gains* There are signs of each relationship, 
but with only four mandatory programs and four second grade 
samples it is inevitable that we have not made heterogeneous 
all the sources of irrelevancy that might have produced 
spurious results* The reality is that if the sample sizes of 
studies is too small to permit a meaningful analyses of 
central tendency across 19 studies* it is even less 
appropriate for conducting responsible internal analyses to 
try to explain why some school districts seem to have achieved 
larger effect sizes than others. 

This is true, not only of the potential explanatory factors 
analyzed above* but also of other factors about which 
individual panalists have speculated* Stephan points out that 
studies conducted at an earlier date tend to show larger 
effects, while Miller suggests that school districts with 
larger effects may have introduced enrichment programs at the 
time desegregation occurred and may have had smaller 
percentages of blacks in the desegregated classrooms* With 
the small samples on hand, it is inevitable, first* that no 
strong probes of the impact of such moderator variables is 
possible; and, second, that many interpretations remain to 
explain why some districts achieved particularly large 
positive or negative gains* 

The points we want to stress are that: (1) the form of the 
distribution of effect sizes is not clear either for the 
population of school districts that have desegregated or even 
for the small sample of districts we have analyzed; (2) there 
may be districts that benefitted more from desegregation than 
other districts"but if so, it is not clear whether they are 
outliers for irrelevant methodological reasons (small sample 
sizes, unstable measures* or initial group achievement 



differences not completely adjusted ^way) or Lor relevant 
substantive reasons; and (3) of the relevant substantive 
reasons, several are contenders as explanatory constructs, but 
their unique contribution cannot be unconfounded from the 
contribution of the factors* The factors at issue include: 
the child's gr^de *it desegregation, the nuTnber of years of 
desegregation, whether the desegregaton is voluntary or 
mandatory, the percentage of whites in the class, the 
copresence of desegregation srd new enrichment programs, and 
the year in which desegregation took place. 

6* Suirnaary of the Reanalyses , A casual reading of the panelists' 
papers leads to the four conclusions mentioned earlier that are based 
\ipon the panel's 19 studies and seem quite consonant with the findings 
of prior meta-analyses by Krol and by Grain & Mahard that involved 
larger samples. These conclusions are: (I) desegregation does not 
decreast; the achievement of black children; (2) it probably does not 
Increase math achievemertt; (3) it probably raises reading scores; and 
(A) the increaiic* in reading scores is somewhere between ,06 and -16 
standard deviation ur^its or about two and six weeks- These last 
estimates were computed from 17 studies, about half of vhich dealt with 
a single year of schooling, and then usually the first one after formal 
desegregation began* 

Our owii analyses corroborate the first two of these findings. We 
continue to find no evidence that desegregation decreases achievement or 
that it increases achievement in math. Our differences involve the 
conclusion about reading* The present analysis suggests that whether 
there is an effect or not depends on the measure of central tendency 
used, vith statistically reliable results emerging from mean gains but 
net from median or modal gains* The implication of the lower median or 
modes is that the mean differences are found, not so much because the 
"average" effect of desegregatior* on reading is positive b\:t because"in 
the panel's sample at least — some school districts made atypically large 
reading gains that skewed the distribution of effect sizes. 

It is therefore difficult to make an estimate of the size, of the reading 
effect. There ±s one range estimate for the mean (between ,13 to *16 
when the same 17 studies from the panel's 19 are used vith each 
analyst's own effect size comptitations — see Table 2), another range 
estimate for the median (,00 to ,08 irrespective of the samples 
used — see Table 1 or 2) , and yet another for the modal effect (between 
-*05 and +,05 — see Tables 1 and 2)* Combining the reading and math 
effect sizes makes no difference to the conclusion that central tendency 
values differ. The estimated means vary between ,07 and ,16 for * 3 17 
common studies; the study medians vary between *00 and ,06; and ^*e mode 
falls between ^,05 and -,05, 

Why do some schools achieve unescpectedly large reading gains? With sc 
few studies, this question cannot be answered in any definitive way. 
There are at most indirect suggestions that such schools may have 
desegregated in the 1960's, had voltintary plans, included the earlier 
grades In their evaluation design, been studied for longer time periods, 
have had a higher percentage of white children in desegregated 



classrooms* and may have incroduced enrichment programs ac the same time 
as desegregation. Such variables could have had independent or joint 
impacts, and it is ine\'itable that other variables could be thought of 
that should be added to any list of possible explanations of why some 
districts gained so much more than others in reading. Among the 
possibilities is chance, for it is noteworthy that the outlier studies 
had smaller sample sisee and that, with the exception of Anderson, the 
districts with the largest gains in reading were not the districts with 
the largest gains in math. While it is not necessary for desegregation 
to impact on both — and Stephan gives an ex post facto rationale for why 
desegregation should affect reading but not math — we would be more 
confident of having identified valid outliers had there been more of a 
consistency in gains between reading and irath* 

If the present analysis had not taken place, there would have beer, vhat 
I interpret to be an impressive consistency of results for reading and 
math combined- When they defined better studies their own way and 
combined all measures and grades, both Krol and Crain & Mahard reached 
comparable me^n estimates of .10. (For Crain ^ Mahard, ths value is 
derived from the combined results of their randomised cxperiements and 
their two longitudinal designs with black segregated controls.) Using 
their own preferred set of studies and considering math and reading 
only, the present panelists arrived ^it estimates varying around this* 
Armor obtained *04, Miller *12 and Stephan .14, and Wortman *17 when his 
two strongest designs were weighted and averaged based on part of his 
sample of 31 e^tudies. These estimates are generally higher than the 
values of Krol and Crain & Mahard, but not by much* Indeed, I suspect 
that few commenturors would find much of a difference between a gain of 
one month snd of one and one-half months (*10 versus *15)* 

The present analyses have muddled these waters by suggesting that the 
means above are noticeably higher than their corresponding medians or 
modes and by further suggesting that the choice of a measure of central 
tendency depends in part on knowledge of the distribution of effect 
sizes in the population* But with such a small sample, the true 
distribution cannot be confidently ascertained. For those who accept my 
analyses, I have substitiired a low degree of certainty about the effects 
of desegregation for the higher degree that used to pertain but that 
depended on distributional assumptions which may be wrong* Social 
science analyses often Increase uncertainty, and this is to be preferred 
to a premature certainty about something wrong or misleading* However, 
it is even more preferable to reduce quickly new sources of identified 
uncertainty* Txi the present case, this liieans examining the 
distributions obtained by Crain ^ Mahard (1983) for their better studies 
to see if they are skewed* 

7. A Comparlaor of the Present Results with Crain ^ Mahard. Crain & 
Mshaid (1983) insist that the effects of desegrej^.ation are best assessed 
from randomised experiments and from studies ^here desegregated 
schooling begins at kindergarten or grade one so that the child bss 
never known segregated schooling* When the randomised experiments and 
the studies with kindergarten and first grade samples were studied 
separately, Crain £t Kahard obtained estimates of * 30 in each. case. They 
therefore interpreted this as the best estimnte of the effects of 
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deiiegregaCior cn the achievement of black children. Such an effect Is 
moderately large by many of the (^trbicrary) standards u.sed for assessing 
the effects of educational Interventions » as Walberg's essay In this 
volume attests. It is certainly a more optimistic value than obtained 
In the meta-analyses revlcved here, Hence» ye will consider the 
estimates of Grain L Mahard In some detail. 

It is clear that their estlnates decrease to some extent when we 
consider medians and modes rather than means. Grain kindly supplied me. 
with the distribution of effect sizes fcr the seven comparisons 
Involving randomized experiments » with Zdep omitted. The meari was ,27» 
the median ,24» and the mode could not be computed. For the 
kindergarten and first grade samples evaluated using before-after 
designs and black segregated control groups* the mean based on 17 
comparisons was ,31* and the median and mode were each ,26, I do not 
know what the mean* median and mode were for all the studies and all the 
grades with before-after measures and black controls. Nonetheless* the 
data above suggest that the medians and modec do not reduce to zero in 
the studies that Grain and Mahard prefer for estimating the effects of 
desegregation, 

Unfortunately* the results of Grain & Mahard are not easy to interpret 
as estimates of generalized causal impact. First* nearly all the 
randomized experiments vere part of Project Goncern and so offer little 
comfort as to the general iz ability of effects. Also* with so fev 
degrees of freedom In the analysis of randomized experiments^ it is not 
likely that the mean effect reliably differs from zero. Second* only 
one of the kindergarten and first grade samples of Grain ^ Mahard was 
included in the present panel's sample — Carrigan — despite the 
specification of both Grain & Mahard and the present panel that 
before-after designs and black controls characterized better studies. 
This discrepancy in the number of comparisons presumably occurs because 
of differences in strategies used to estimate standards deviations 
and — principally — because Grain ^ Mahard were willing to accept pretest 
measures that the present panel vould not accept because it required 
that pretest ard posttest measures tap into the same conceptual domain. 
For understandable reasons* the pretest measures of very young children 
tend to reflect '^academic readiness" rather than the academic 
achievement that Is assessed at the posttest. If the usual selection 
bias operated and the child ren attending desegregated schools were more 
able or more motivated than their segregated counterparts* then the 
reduced pretest-posttest correlation caused by differences between the 
readiness and achievement measures would probably result in 
overestimating the effects of desegregation in each study (Gampbell & 
Boruch, 1975), Gonsequently * it is unlikely that valid estimates of the 
effects of desegregation were obtained with the kindergarten and first 
grade samples of Grain & Mahard* though the authors have Indeed 
identified a significant issue. After the first generation of 
desegregation in a district* no students enter desegregated schools from 
segregated oneS"nearly all begin and end their schooling in 
desegregated classes, Gonsequently* It Is of special importance to 
learn how desegregation is related to the achievement of ver^^ young 
children. 
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The estimate o£ Grain & Mahard that most closely approximates the work 
of the present panel is based on aJl grade levels, all outcome measures, 
before-after designs, and black control groups. As mentioned earlier, 
the estimate they obtained was ,10, and this Is much closer to the 
panel's estimate than the probably inflated value of ,30 provided by 
studies of kinderg^jrten and first grade children fo'r which initial 
differences were not well-controlled. However, nothing in the present 
panel's work specifically refutes an Implicit claim — in Grain & 
Mahard — that desegregation may have larger impacts at younger grades. 
To say tliat .30 may be inflated is> not to say the true value for the 
youngest children is ,10, The issue of grade differences in effect 
sizes has not been solved by either the present panel or Grain & Kahard, 
and must rotnaln an issue tor further research. 



INTFRFRETATIOK 

T want now to Interpret the meaning of both the absence of gains in 
mathematics and the presence of reading gains of between two and sJk 
weeks. To do this, I broach two 5£;sues, ?lrst , I ask what implications 
the findings have for various stakeholder groups, and in so doipg 3 also 
explore how generalizable the findings are beyond the 19 studies 
examlnedi* Second, I ask what implications this meta-analysis project 
has for theories of research synthesis, 

1* Stakeholder Analysis 

a. Protagonists of School Desegregation, The analyses I have 
presented might give some comfort to protagonists of school 
desegregation, partlculary those who support it for reasons of 
equal access, the improveipent of race relations, or the 
enhanceioent of self-esteem rather than for reasons of academic 
achievement. For such protagonists the crucial finding from 
all the analyses of all the scholars is that school 
desegregation does not decrease the achievement of black 
children. If it did, this would represent an undesirable side 
effect of desegregation with which protagonists would probably 
have to deal ethically, ideologically, and politically. My 
guess is that It is more difficult to argue that a decrease in 
achievement is of no consequence than it is to argue that the 
<ibsence of an increase is of no consequence. Unintentionally 
decreasing achievement would be a worrisome side effect cf 
desegregation that no protagonist could ignore* 

Protagonists of school desegregation can also take some succor 
from an as yet imperfectly corroborated trend In the data. 
This Is that achievement gains may be larger in younger 
children who have rot had to go through as long a prior 
experience in segregated classes. Indeed, one of the major 
points in Grain & Mahard — that we could not independently 
test — Is that achievement gains arc greatest of all if black 
children have never teen segregated. This is a very important 
point, for many of the advocates of desegregatior view it as a 
means of providing desegregated — or preferably, fully 
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Integrated — education to all children for all of their school 
career* From this perspective, the group of children who 
start out in segregated schools are not the group of greatest 
interest* Of more concern are those who have never been 
segregated and will never experience the historically 
circumscribed difficulties associated with being among the 
very first children to transfer within a desegregated school 
district* Such pioneers move into enviroiimentfiv that are 
novelj not only for them but also for teachersi 
administrators, parents and local leaders* Because of the 
novelty, more mistakes are likely to occur than is the case at 
a later date when, new cohorts of children come through the 
system, and teachers, administrators and parents should have 
benefitted from earlier mistakes* Later cohorts might be 
expected to benefit more from desegregation, both because they 
have never known segregated schooling and because the school 
personnel are more experienced with education in mixed racial 
settings * 

Protagonists of desegregation might also note that over half 
of the studies examined by the present panel involved only one 
year of desegregation* Moreover, the typical fall-spring 
testing sessions involve less than a complete school year* 
Thus, most of the studies involved only a small fraction of 
the total time that children experience desegregation, 
especially if they enter desegregated schools in the early 
grades* Protagonists of school desegregation might vonder if 
its full impact has yet been evaluated and they may point to 
the larger effects in two-year studies to suggest that the 
cumulative impact of desegregation may be much larger than its 
first year effect- The major problem with this argument is 
that the studies testing three years of desegregation produced 
no effects* Consequently, protagonists of desegregation would 
have to discredit the three-year studies in order to make the 
case that desegregation has not yet been tested at its 
presumptively most efficacious* However* it is not difficult 
to discredit these studies since they are only three in number 
and they undoubtedly differ from the majority of studies in 
many ways that are correlated with lover achievement g^ins* 

The Perspectives of Antagonists of School Desegregation * The 
present analyses should bring most succor to antagonists of 
school desegregation* Where before they would have had to 
acknowledge the gains in reading caused by desegregation and 
would have had to argue that their practical implications are 
trivial~as Armor has done in his present essay—antagonists 
can now point to analyses which suggest that there have been 
no real gains in reading because of desegregation in most 
school districts* This involves a shift in the argument— from 
how meaningful the obtained reading gains are considered to 
be, to whether there are any gains at all with value worth 
debating. But although the medians and modes in Tables 1 
through 5 could be used by antagonists of school 
desegregation, I have tried to stress how unstable these 
estimates are and how much they might be changed by adding 
just a dozen more cases to the distribution of effect sizes* 



Antagonists of school desegregation can also point to the 
opaque trend in the data for mandatory programs to result in 
zero effect sizes and for larger effects to be found with 
voluntary programs. Few antagonists of desegregation oppose 
plans In which local authorities agree to desegregate and 
receiving schools voluntarily accept pupils who volunteer to 
go to the receiving schools (or whose parents "volunteer" for 
them). The objection is to mandatory desegregation which » In 
both my analysis and Stephan's, produced no reading or math 
gains, (This comparability was achieved despite the fact that 
Stephan classified only two of the panel*s studies as 
mandatory, whereas using the essays in this volume by Grain 
and Armor, I classified four as mandatory, although one was 
by Carrlgan*) However, little confidence can be placed in the 
idea that mandatory desegregation plans cause no reading 
gains. Given the small number of studies overall, and of 
mandatory studies in particular, the mandatory/voluntary 
distinction was correlated with the year desegregation took 
place, the test used zo measure achievement, the region of the 
country (two studies were in the Dallas/Ft, Worth area), and 
was probably also correlated with many other factors that 
would emerge as soon as one examined in detail the specifics 
of the mandatory desegregation studies by Sheehan & Marcus, 
Evans and Van Every * 

Antagonists of school, desegregation can also point to the 
paucity of clearcut evidence about desegregation plans that 
will raise school achievement. Protagonists of school 
desegregation, and persons whose job it is to plan the 
desegregation effort in a particular community, want to know 
what types of desegregation will be effective. They prefer 
this specific question to the more global; "How effective is 
desegregation in gereral in raising achievement?" All the 
parties concerned with dceegregation research realize that 
there is no standard desegregation treatment, but many of the 
protagonists of desegregation hope to discover a set of 
activities that, when implemented in newly desegregated 
schools, will raise achievement, among other things* The 
present analysis has pointed with little confidence to some 
possible elements of effective desegregation plans. But 
nothing in the list of elements is new, and after the panel's 
reviews, nothing is better "proven" as a causally efficacious 
element oz desegregation plans than was the case before* 
/antagonists can point , therefore, to the saliency the present 
review gives to the continuing uncertainty about the elements 
of desegregation that enhance achievement* This is not to say 
that the present meta-analysis proved all-or even most — of the 
prospective causal elements, or even that it probed the better 
corroborated among them. All we mainrcin is that xz probed 
some of them, but failed to make us any more confident zhat ve 
know bow to put together desegregation p7cans that vil] raise 
achievement in rea<iing and math. 



Persons Planning Desegregation Activities * Irrespective of 
their personal beliefs about the desirability of 
desegregation, mandated or otherwise, there are some groups of 
persons who have to plan desegregation activitie£5* One such 
group consists of judges, civil servants, consultants, and 
school district officials who develop desegregation plans for 
ijchool districts or metropolitan areas* Such per£5ons want to 
know about the typos of desegregation plan, or the major 
elements within an overall plan, that will produce the kinds 
of outcomes they most value from desegregation. The present 
panel's work provides nothing of substance to help such 
planners* It mighty however, make a minor contribution to 
underlining their morale, for the difference in outcomes 
between the means, medians and modes suggests that the effects 
of their labors on achievement are likely to be mininal, at 
least in the short term and to the extent the backward-looking 
analyses on which this review is based are pertinent to the 
immediate future* 

This last point is crucial* For many theorists of evaluation, 
its functiOTi is less to sinnmarize what has happened In the 
past and more to discover what might be effective in the 
future* In this context » it is worth noting that the major 
difficulties with meta-analysis concern the possibility that 
the bias in one direction may be greater than in the other 
across all the studies under review* The panelists dealt 
exhaustively with biases that might lead to false conclusion; 
about whether the relationship between desegregation and 
learning gains is cau£;al, but few of them considered biases 
that limit the generaiizability of findings and hence their 
presumed utility for planners* In fact, 16 of the 19 studies 
were begun in the 1960's, and only one is later than 1975* 
The dearth of later studies is striking, and Armor's essay 
contains an Important paragraph expressing indignation that so 
few evaluations of school desegregation were undertaken in the 
1970's, a decade characterized by so many large-scale 
evaluations in other areas within education. Host of the 19 
studies under examination were dissertations or local efforts 
by the staff of a school district. This may explain why the 
sample sizes are so small, the documentation of desegregation 
activities so meager, and the measurement plan so sparse* 

Another constant bias^ is obvious* The panel was constrained 
to examine how desegregation impacted on the achievement of 
black children. Yet for most planners, achievement does not 
exist in a vacuum* The utility of the achievement gains 
caused by desegregation can vary in meaning depending. on 
whether the desegregation activities in question also reduce 
or widen achievement gaps between blacks and TSrhi^es, are or 
are not accompanied by an increase or reduction in interracial 
prejudice, are or are not accompanied by white flight, are or 
are not associated with self-esteem gains, ate or are not 
associated with community support, are or are not related to 
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changes in real estkle values* ^^rc ur ate not atssociater' with 
the toundlnij of magnet or lab schools* etc. By exaiaining just 
school ccsegregaticn and black achievement, much of the 
intcrpretativf* context vital to planners is lost. 

A second gtoup of planners is composed of teachers* boch those 
contemplating desegregation and those ali'csdy teaching in 
desegregated classrooms- Tn theory, reseiarch could be o£ help 
CO those in identifying pr,ictices they car implement that will 
improve the functioning and results in classrooos. However, 
the present meta--analytic efforts do not speak to such 
learning needs* The teacher's needs are more micro than 
macro, more concerned with process than outcome, and with 
explaration than descriptive causation* The question on which 
the panel worked is a question that meets the interests of 
ccnCral government officials with responsibility for oversighc 
more than it meets the interests uf those who must plan for 
desegregation In specific school contexts* 

Persons Honestly Seeking To Learn VThat Desegregaf ion Has 
Accomplished* The panel's papers help chose who would 
honestly understand what desegregation has accomplished by 
questioning the utility o£ so global a label as 
'"desegregation*" Miller's analysis shows that, afteir the mean 
effect size is accounted for* more variance remains than is 
due to chance* This suggests that systeinatic forces have to 
be taken into account over and above whether desegregation 
took place if there is to be any reasonable prediction of 
effect sizes* Elementary consideration of the decentralized 
structure of educational decision-making suggests chat 
desegregation plans; will differ from location to location and 
that* even vhere chey appear similar on paper* there will be 
local adaptations to suit local conditions* From the 
perspective of someone seeking to learn vhat desegregation has 
achieved, elementary questions need to be asked: *'What dues 
desegregation mean?*'; ''Vhat arc the criteria that should he 
rsed to create clustered of desegregation activities?*'; and 
*'How wen do the different clusters or types cf desegregation 
predict differences in iichievement outcomes across districts?*' 
t\t presents p^^sons interested in learning about school 
dcsegre|:;^tlon are more likely to bavc learned to identify the 
i»cre pertinent questions than they are to have learned answers 
tu these questions, 

?A;t there are some person*! interested in the effects of 
desegregation* very globally conceived, most of ^hom are 
goviimment officials with oversight ret^iponsibility , 
journalists* or scholars* The p^^^^^^ent essay may help 
sensitize then tn the possibility of ccnclderable differences 
in affects from district to district and to the possibility 
that, across all districts* effects may be highly variable and 
fiver* sl^-cved* The possibility of skexjnecs might present them 
with a problem. Although the ncai> represents the gloha] 
impact of desegregation painted on a broad national canvas, it 



3*> of no comfort to Judges ftn6 school districts contemplating 
desegregation or to teachers wcrrying about how to handle: a 
racially mixed class. ?or some of these people, the Diode is 
more liuinediately meaningful than the mean* It may be les^i 
Tneaningfui in the future, of course* if (1) there really are 
outliers, (2) the causes of large gains can be explained, and 
(3) school districts can adopt the causal elements present in 
the schools with large effects;. But we do not yet know what 
these elements are. In the absence of such knowledge, the 
differences between the means » medians, and modes highlight 
anev the conflicting information needs of the many groups in 
the national educational system who have a stake in 
desegregation. The differences are most apparent (1) with 
resyicct to what should be evaluated—desegregation in general* 
a specific type of desegregation plan, the particular plan in 
a particular district, or elements within plans?; and (2) with 
respect to what should be assessed — achievement, school 
discipline » race relations , self-esrcera, enrollment figures , 
local tax support for education, local political support for 
desegregation, home values » etc*? But the differences in 
information needs are also apparent with respect to (3) which 
ir,easures of central tendency is most appropriate. Different 
measures speak more to the interests of some stakeholders than 
others- 

2. Theories of Research Synthesis, The present panel represents a 
unique attempt to probe to what extent experts with three different 
presumed commitments would converge on a common answer about how 
desegregation has affected the achievement of black children* Grain and 
yortman had already concluded in review articles or papers that 
desegregation ir*creased achievement; the opposite conclusion has been 
drawn by Armor and Miller; while Stephan and Walberg had published on 
the i&sue but had taken more neutral stances, although Walberg has given 
court testimony largely opposed to desegregation. The hope was to 
achieve a common estimate cf effect size desspite the different 
commitments* based on a theory that the results would be more credible* 
and perhaps even more valid* if they could be replicated across the 
heterogeneity associated with the analysts' prior professional 
commirments* 

In general, the effect sizes for math and reading combined did reflect 
the prior commitment. Highest were those of Wortman C*17)» and Grain* 
who stressed the results from his kindergarten and first grade samples 
apd from the raridomized experiments he studies (,30 for all outcome 
measures combined)* The next highest estimate was from Stephan (*14 
without corrections for length of desegregation), and lowest of all wa5 
Armor (*04)* The person least fitting expectarions was Miller* whose 
*12 value was intermediate- 
Actually* the theeretical rationale for pluralism of analysts was only 
partially realized* given the decision made before the panel met to 
restrict the meta-analyses to "good" studies and to use Wortman' s prior 
vork to generate that list. One of the major points in meta-analysis 
where ideology and other commitments enter in is when relevant studies 
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are selected for analysis. Panel members were free to suggest studies 
for the core list* and Armor succeeded in having two studies added that 
had negative effect sizes (Sheehan & Marcus* and tJalberg), He also made 
a strong and persistent case for excluding Rentsch snd including 
Carrigan, But few considered calls were heard to add other studies* 
even though Grain had a list of 93 that he and Mahard considered 
relevant, more than half of which may have been randomized experiments 
or longitudinal designs with segregated black control groups. In 
retrospect* the decision to restrict the selection criteria to a common 
set rather than let the panelists select their own, and the failure to 
assesi: each of Grain's 93 studies according to the panel's criteria of 
adequate methodology, may have unnecessarily restricted both the sample 
of studies and the heterogeneity in assumptions on which the theory 
behind the use o_f multiple panelists depends. 

It is not difficult to see why the decision was made to restrict the 
meta-analyses to "better" studies. After all* Krol has found smaller 
estimates with his "better" studies* as also had Wortman* King and 
Bryant, But Grain obtained larger estimates with his "better" studies. 
Obviously* chance differences in the studies available* or differences 
of opinion about what makes better studies, may have contributed to the 
apparent puzzle about whether superior methods were associated with 
larger or smaller effect sizes. Another point is also worth keeping in 
mind. Although one of the rationales for pluralistic panel members was 
the credibility and validity afforded by convergence* a second rationale 
is that divergence in their results might serve to force out the 
differences in assumptions between advocates and opponents of 
desegregation* thereby sharpening the focus for future research. Yet 
the likelihood of such differences being forced out is presumably 
greater the more freedom panelists have to select studies for review. 

Another decision that was made before the panel convened was to use 
meta-analysis. This technique depends most heavily on the assumption 
that the average bias is zero with respect to threats to internal, 
external, construct* statistical conclusion, or any other type of 
validity (Gook & Leviton* 1980), This assumption is usually dealt with 
in either or both of two ways. First* a subsample of studies is 
isolated for which the assumption is made that the bias is zero* and the 
estimate from this sample is then compared to the estimate for the 
remaining subsample where bias might be a problem. If there are no 
differences in the estimates, the conclusion is drawn that the biasing 
force in question has not operated* The second strategy is to assume 
the source of bias away by postulating that the total sample studied is 
heterogeneous with respect to the threat in question. This last 
assumption is more credible the more the sample differs on irrelevancies 
correlated with the major outcomes, 

Desegregatior; research is problematic for the meta-analyst since Wortman 
has shown that studies without control groups might be biased, and few 
analysts are willing to use norms or white children as "control groups," 
The need for control groups entails that few studies will meet minimal 
methodological characteristics* The sample of studies will also tend to 
be highly variable* given the wide range of desegregation activities in 
the decentralized education sector and the wide range of children* 

42 



grades and times studied* Consequently* small samples of possibly 
abnormally variable estimates will be meta-analyzed* It is difficult to 
imagine arriving at confident estimates of distribution and central 
tendencies in this situation; and it is also foolhardy to expect to 
break the data dotm in multiple ways so as to examine the correspondence 
in estimates across different types of desegregation activities* 
different years when desegregation began, different regions of the 
country* etc* Consequently, to rule out threats one has to rely on 
there being *^enough" variability in region, year of study* type of 
activities implemented, etc* But given the small samples, it is not 
easy to be confident of "enough'* heterogeneity in conceptual 
irrelevancies , hence the low level of confidence I have placed in most 
of my own conclusions and those of the panelists. 

These meta-analytic endeavors point to another problem with the method 
that overlaps with the problems in using small samples to estimate 
populations that may be complex and highly variable* Once one has 
postulated that a skewed distribution may be present* the guiding 
question becomes the explanatory one; *'Why are there outliers?*' 
Explanation is not a strong point of meta-analysis* To explain* 
presumes that we have measures of the potential explanatory constructs 
for a l?.rge sample of studies* Rarely is this the case with 
meta-analyses, for their availability depends (1) on the extensive 
measurement of what is Implemented as part of a treatment — in the 
desegregation studies examined, little was available from reports to 
help with this; and (2) on the extensive measurement of causal 
micro-mediating processes* For desegregation and reading* such 
measurement might include* but not be limited to, the assessment of 
dominant language patterns inside and outside of classrooms* But the 
sample size of studies with such measures might be expected to be low 
since the relevant hypothesis about language patterns had rot been 
developed when the earlier ^valuators did their work* Indeed, the 
theory developed because of their work and the anomalies in the data 
\jbich the work revealed* Since the number of studies with adequate 
jieasures of potential explanatory variables will often be low in 
meta-analysis for reasons of cost and because of the dynamic* evolving 
nature of theoretical explanatory constructs* mets-analysis will rarely 
result in confident explanation* This was certainly the case in trying 
to explain the outliers in Figures 1 through A* Many potential 
explanatory forces were isolated, but none of them could be unconfounded 
from each other with the sample sizes and measures on hand* 



Conclusions 

My OTO reading of the panelists' papers and my own analyses lead me to 
the following conclusions about how school desegregation has influenced 
the academic achievement of black students* The conslusions are based 
on only about 17 studies, and their general izability is unknown* 

1* Desegregation did not cause any decrease in black achievement* 

2* On the average, desegregation did not cause an increase in 
achievement in mathematics* 
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3. Desegregation increased mean reading levels. The gain reliably 
differed from zero and was estimated to be between two and six 
weeks across the studies examined* Only one panelist (Stephan) 
computed the reading effect per 8 month school year* His estimate 
is between five and six weeks of gain per year. But since none of 
the studies involved jnove than three years of post-desegregation 
research, it is not possible to compute the mean gain over a 
child's total school career in desegregated classrooms, 

4. The median gains were almost always greater than zero but were 
lower than the means and did not reliably differ from zero. The 
modal gains were even less than the median gains and varied around 
zero, 

5* The differences between the neans, medians, and modes result 

because the distribution of reading effects appears to be skewed, 
with a disproportionate number of school districts seeming to 
obtain atypically high gains, 

6, Studies with the largest reading gains can be tentatively 

characterized along a number of methodological and sustantive 
dimensions. Including: small sample sizes, the study of two or 
more years of desegregation, desegregated children who outperformed 
their segregated counterparts even before desegregation began, and 
desegregation that occurred earlier In time. Involved younger 
students, was voluntary, had larger percentages of whites per 
school, and was associated with enrichment programs* 

7* Hone of the above factors can be Isolated, singly or In 

combination, as causes of any of the atypically large achievement 
gains in reading that were obtained in some school districts, 

8, The panel examined only 19 studies of desegregation, with most 

panelists rejecting at least two of them on methodological grounds. 
When the results for each study (or each comparison) are plotted 
for reading or mathematics, the distributions are based on so few 
observations that I could not accept the assumption that the 
obtained distributions closely approximate what the underlying 
population distributions are. Because of the small samples and 
apparently non-normal distributions, little confidence should be 
placed in any of the mean results presented earlier, I have little 
confidence that we know much about how desegregation affects 
reading "on the average" and, across the few studies examined, I 
find the variability in effect sizes more striking and less well 
understood than any measure of central tendency* 
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The Tvidence on Desegregation and Black Achievement 

David J* Armor 
David Armor and Associates 



The debate over the costs and benefits of school desegregation^ 
particularly In Its mandatory forms, continues unabated today, nearly 30 
years after the fateful Brown decision by the U,S* Supreme Court* No 
issue has been more central to this debate than the question we address 
here; the Impact of desegregation on Black student achievement* 

Indeed, it is remarkable that this question reioains in controversy 
today, considering the extent of school desegregation over the past 
twenty years and especially given the mandatory methods imposed by the 
courts over the past fifteen years. One wonders how many courts have 
ordered busing, how many agenclce: have allocated time and money, and how 
many Black parents have willingly sent their children to distant schools 
out of their neighborhoods, on the assumption that desegregation vould 
yield academic benefits for Black children* 

Obviously, more is at stake in desegregation policy than the acadenic 
progress? of students* Desegregation is a highly desirable social 
policy regardless of its educational benefits, and many educators and 
parents will and should seek it despite research findings* On the other 
hand, it is one matter to agree that school desegregation is a 
desirable policy and quite another to make it compulsory regardless of 
ether considerations* The moral imperatives permitting coercion in 
social policy make it unlikely, in my opinion, that our courts would 
have abandoned the traditional neighborhood school policy in favor of 
mandatory busing without the belief that they were actually benefiting 
the education of Black students. Why else would so inany courts hear 
evidence, and so many legal journals publish treaties on this issue? 

Aside fron the legal importance of the achievement question, it does 
have immediate relevance to educational policymakers, especially in 
this day of tight budgets* It is beyond dispute that we need programs 
to enhance minority achievement. The key question is, what kinds of 
programs? In recent years significant amounts of time and money have 
been devoted to improving racial balance in schools, justified in part 
by its Supposed educational payoffs. Is this resource investment in 
fact yielding a fair return, in terms of improving minority achievement, 
or would other programs have greater impact? In other words, are racial 
balance activities cost*-ef fective when compared to other available 
alternatives? If not, we should re-order our priorities and invest in 
programs that promise to work* 

Finally, the issue of desegregation and Black achievement should have 
more than a passing interest to parents of Black children, who for years 
have borne the heavieijt personal cost of desegregation by enduring long 
bus rides, separation from familiar surroundings, and curtailment of 
extracurricular activities* It is quite likely that, over the long run. 
Black parents' support of busing for the purpose of desegregation would 
lessen if desegregation was found to have minimal impact on their 
children's rate of learning* 
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For all these reasons* the National Institute of Education must be 
commended for bringing together* for the first time* a representative 
panel of experts to review the evidence and pass judgment on this 
difficult but vital issue* At the same time* more than one observer 
vill be surprised at the small number of studies (19 in all) meeting the 
minimal scientific standards established by the panel, and perhaps 
shocked that only three of these studies have been conducted within the 
past ten years* vhen school desegregation has been at its peak,* It is 
almost as though educational researchers and their funding 
agencies — including KIE—believe that the issue is settled, or no longer 
important* It is clearly an important question, and even a cursory 
review of the available literature shows that it is clearly unsettled* 
Hopefully, this panel will offer a consensus judgment that will finally 
settle the controversy* 

Before turning to the studies selected for review by the N'lF panel* I 
will comment briefly on several other comprehensive review efforts. To 
a large extent the approach taken by the panel culminates an 
evolutionary sequence that can be observed in the previous attempts to 
grasp the essential truths in this varied and complex literature, 

PREVIOUS REVIEWS 

Much of the early disagreement over the desegregation and achievement 
issue stemmed from reliance on a single study* or on a small number of 
studies where variation ir results and conclusions might be expected 
(e,g,. Armor* 1972 and 1973; Pettigrew, 1973)* Yet disagreement 
persists even among the coicprehensive reviews, all of which investigate 
many of the same studies. 

The first review to encompass a large number of studies was carried out 
by Weinberg (1970). Like his most recent review, Weinberg covers a lot 
of studies but makes little or no attempt to select studies according to 
their methodological adequacy for causal inference (Weinberg, 1977), As 
ve shall see* his conclusion that desegregation significantly benefits 
minority achievement was undoubtedly affected by his failure to consider 
a study's scientific rigor. 

The second comprehensive review by St, John (1975) made considerable 
progress over Weinberg* Not only was her study coverage broad* but she 
additionally classified studies according to the research design 
employed, allowing her to observe the relationship betVeen methodology 
and the impact of desegregation. When St* John took design rigor into 
account, she reported that the evidence was mixed, preventing a firm 
conclusion about the benefit of desegregation for Black achievement* A 
later review by Bradley and Bradley (1978) did not expand on the state 
of the art over St, John* They did conclude that methodological flaws 
ir.paired the entire group of studies* and that nothing could be decided, 
A distinct advance was made in Krol's (1978) review* where he applied 
formal "meta-analysis" to 55 studies* as that phrase has been used by 
Glass (1978) and others* The technique Krol used involved two critical 



*Different panelists* including myself, will take methodological 
exception to some of these studies. 
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steps that ^ire lacking in previous reviews* First, studies were 
screened for minimal methodological adequacy (e.g., appropriate 
treatment condition and quantitative results) and coded as to a variety 
of conditions related to the type of research design and other study 
attributes* Second, achievement test results were converted to 
quantified standardized estimates by taking the ratio of test score 
means to their standard deviations* This allows estimates of the 
magnitude of segregation effects, as well as the impact of specific 
study characteristics on those effects. 

Using this approach Krol concluded that the average effect of 
desegregation on Black achievement is *16 standard deviations, which 
(depending on the type of achievement test) amounts to anywhere between 
Ih to 3 months of progress during an academic year. However, this 
effect was not statistically significant* and the effect for that subset 
of studies with a valid control group was only *10, which again was not 
significant. The major limitation for the Krol study is that the number 
of studies was small* and no adjustment vras made for control group 
selection bias; that is, for treatment^control differences prior to 
treatment. Moreover, the way he estimated effects for studies without 
control groups assumed that a control group vrould experience no gain. 
This is not a tenable assumption for achievement test data, where some 
acadenic g^"*'th is the norm for most students at least through the 10th 
grade * 

The most recent large-*scale review was carried out by Grain and Mahard 
in several stages (1982). The latest version of this review also uses 
the meta-analysis approach, with quantified effect estimates and study 
characteristics coded for some 93 Studies. Although the number of 
studies is larger than in Krol's review* Grain and Mahard intentionally 
included studies with weaker design characteristics in order to test the 
impact of design flaws on desegregation effects. Their overall effect 
size mean Is .065 standard deviations, which is both negligible and 
non-significant * 

Grain and Mahard do find differential major effects for grade level, 
with an average effect size nearing .3 for students desegregated at the 
kindergarten or 1st grade level, but dropping off markedly to near 0 in 
the 2nd and higher grades. On the basis of this finding, they argue 
that desegregation can have a significant effect on Black achievement, 
providing it starts in or before the 1st grade; it will have little or 
no effect on students starting desegregation in later grades. It is not 
clear frota the study whether this effect occurs only at these early 
grade levels* or whether it Is cumulative. In any event, there are some 
further methodological problems with this conclusion. It appears* for 
example* that none of the studies which have tested kindergarten and 1st 
graders have been adjusted for possible selection bias* which continues 
to be a major problem in this field. We will take this issue up once 
again in our concluding section, after reviewing the NIE studies, 

NI£ STUDY PROCEDURES 

It is clear from the foregoing review that there is still disagreement 
among the experts about the effect of desegregation on Black 
achievenent. The purpose of the NIE panel is to establish 
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methodological guidelines for selection of studies* to review the 
studies so selected* and to decide what these studies say about the 
effect of desegregation on Black achievement* X will conmient briefly on 
these guidelines, leaving their major exposition in the capable hands of 
Dr* Worrman* 

Study Selection Guidelines 

The major reason for variations in conclusion of major reviewers Is that 
they are looking at different sets of studies, which vary greatly as to 
their adequacy for making a causal inference* By establishing ^'minimum" 
standards for selecting studies, the NIE panel does not mean that the 
resulting set is "pure*" Indeed, there may be no such studies in 
existence. The very nature of the process being studied prevents the 
Ideal experiment, where one can eliminate all confounding factors but 
the factor being tested* It is believed, however, that studies selected 
according to these guidelines have the best chance for ^.rriving at a 
decision about whether desegregation itself — and not other factors—was 
responsible for changes, if any, in Black achievement* 

For example, the guidelines exclude cross-sectional studies, because 
they do not allow determination of whether desegregated students have 
actually gained on -the achievement test in question compared to 
segregated students, or whether differences simply reflect prior 
differences between segregated and desegregated students that persist 
over time. Likewise, longitudinal (over-time) studies without a control 
group of some kind are also excluded since some academic growth can be 
expe^jted of nearly all students during their school career, regardless 
of desegregation experiences* A segregated control group is necessary 
if one wishes to conclude that desegregated Black students have gained 
or lost in comparison to Black students who remained in segregated 
schools* 

Thus, in addition to the usual requirements of quantif iability, 
relevance, and so forth, all selected studies fulfill a basic 
quasi-experimental design, with pre- and post-tests as well as a 
segregated control group (where segregation is defined as 50 percent or 
more Black)* We do not imply, however, that there are no further 
methodological problems* Only one of the studies selected is a 
randomized experiment and therefore the control group is not generally 
equivalent to the treatment group prior to the start of desegregation* 
tJortman's preliminary analysis shows that the correlation of pre-test 
and post^test effect sizes is .74* This condition raises a serious 
threat to causal inference, because — just as in a cross-sectional 
study"any observed differences between desegregated and segregated 
students after desegregation could simply reflect pre-existing 
differences between the treatment and control groups* 

Fortunately* the selection criteria also require pre-test means to 
ensure that adjustments can be made to remove the pre-treatment effects* 
As we shall see, adjusting the control groups for initial differences 
has a significant impact cn one's conclusions from these 19 studies* 
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I disagree somewhat with two of the guideline provisions. First, the 
adjustment method to be described in the next section is not infallible 
and is itself based on a number of assumptions. While it probably works 
well for modest pre-test differences, th^ire is no guarantee that it 
corrects properly for gross differences between treatment and control 
groups, say those approaching or exceeding one standard deviation* 
Since researchers are reluctant to compare the growth patterns of vhite 
and Black students precisely because their differences approach this 
magnitude, I question whether it makes sense to compare two groups of 
Black students who exhibit similar differences* 

Second, the guidelines do not require equivalent pre- and post-tests, 
but only that the content is similar and that the same test is used for 
both treatment and control groups • For example, SRA reading might be 
used as the pre-test and Iowa reading as the post-test* Although one 
can convert each test score to a standardized score, using that testes 
standard deviation, this converted mean still reflects test content, 
thereby preventing us from establishing that the treated group actually 
changed on the criterion in question. Moreover, if this issue is 
combined with substantial pre-test differences, it is quite possible 
that spurious effects can arise (e,g,, high -achieving Black students can 
show greater relative gain from the CTBS at time 1 to the Stanford at 
time 2 than low-achieving Black studetnts, and more than high--achievers 
would show from CTBS at time 1 to CTBS at tine 2) • 

Fortunately, only one study (Rentsch, 1967) embodies both features and^ 
accordingly, 1 have excluded it from the review in the next section. I 
have also excluded the Thompson and Smidchens (1979) study on two 
grounds; its segregated control group averages only A2 percent Black, 
which means it is not segregated by the 50 percent criterion, and no 
pre- or post-standard deviations are available for the purpose of 
computing a standardized effect estimate, A sensitivity analysis is 
shown in the discussion section to test the impact of these exclusions 
on my results. 

Analysis Procedures 

The fact that pre-test differences have a high positive correlation with 
post-test differences in the studies being reviewed makes it imperative 
to adjust post-test scores for pre-test differences. If this is not 
done, then desegregation effect estimates will be biased by pre-existing 
differences between segregated and desegregated students. 

In general, I have followed the procedures outlined by Wortman (1982), 
with several refinements which are described here. Ideally, what one 
would like to have is a population standard deviation for each grade and 
test, so that truly standardized means could be calculated independent 
of sample variations. Unfortunately, this information is not readily 
available, and it is not available at all if one wishes to use estimates 
for Black populations alone. Therefore, sample estimates of standard 
deviations ipust be used for calculating adjusted effect estimates. 
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Hy procedure differs from Wortman's only in the fact I pooled standard 
deviations* wherever possible to improve the reliability of the standard 
deviation estimate. If the data shows an apparent fan-spread effect* 
indicated by higher post-test standard deviations than pre-test standard 
deviations* then standardized effects were computed separately for titae 1 
and time 2 means using pooled standard deviations for each tine* If no 
fan spread was apparent, then all standard deviations were pooled for 
the estimate. 

Moreover, I nade estimates even where some or all sample standard 
deviations were missing, if only pre- or post-test standard deviations 
were available* then they were pooled for the population estimate. In a 
couple of instances I used standard deviation estimates from other 
studies in our NIE set, providing they were based on the same test. The 
advantage of this approach is that a greater number of adjusted effect 
estimates are available than in Wortnan's approach* This analysis 
feature is fairly critical, since many otherwise excellent studies in 
our set have all of the design requirements and the pre- and post-test 
means* but lack only standard deviation estimates (sometimes from only 
one time period). It seems improper to exclude such studies from effect 
size means when other standard deviation information can be sued to 
provide reasonable approximations. 

Other less important analytic issues will be raised in the 
study-by-study discussion, to which we now turn, 

REVIEW OF THE STUDIES 

A summary of desegregation effects on Black achievement from each of 17 
studies reviewed is tabulated in Table 1, More detailed information* 
including pre-test means* gain scores, and pooled standard deviations 
are shown in an appendix table* along with Wortman's effect estimates 
(which are very close to mine in most instances where he computes them). 
Table 1 also shows the results of significance testing carried out by 
each study's author, denoted by an asterisk next to the effect estimate 
if it exceeds the ,05 level, 

Anderson 

The first study in the group* a voluntary transfer plan in Nashville, 
shovs the largest effect sizes of the studies reviewed, for both math 
and reading. It is not only statistically significant (by the author's 
test), but educationally large as well, with reading gains nearing 1 
standard deviation. Note that the study has converted. test scores into 
T-scores relative to each grade level, so that decreases in the means 
are not inconsistent with increases in raw score means* Also, given 
this type of standardization, fan spread cannot be detectt^ and so all 
sample standard deviations were pooled for the estimate. Since the two 
groups were equal on pre-test means* fan spread should not be a problem 
in any event, 

Beker 

This study evaluates a voluntary transfer plan in the North, Our 
analysis differs somewhat from Wortman (other than using pooled standard 
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40 



SUMHARV OF 


THE EFFECTS OF 


DESE^^RECATION 


ON BL?^CK ACH:EV£M£>C 


Study 


Grade Leve 


1e Tested 


De&ecrecat ion 


Effect Sla 


AiJthor 


Pre - 


Post*" 


Keading 


Kath 


Anderson 


2S - 


4S 


4.69* 


*,5C» 


Baker 


2F - 


2S 




- , 2E 




3F - 


3S 


+ .17 




BowTT^a n 


3F - 


'< 


4,03* 


* , 05 




3F - 


5S^ 


_ c c 


— , J J 


Ca rr i9&Ti 










IS - 


2S 


4,13 


_ _ 




2S - 


3S 


',19 


_ _ 




3S - 


4S 


4,21 







4S ' 


5S 


4.10 


— — 








-,11 




i & T r. 






— n 1 


-,12 


Evans 








-,12 




sr - 


5S 


+ ,06* 


4,26* 


luanicki 


2S - 


3S 


,00 


_ _ 




<S - 


5S 


,00 


-- 




6S - 


7S 


,00 


— 


Kl ein 


lor - 


lOS 


,00 


- . OB 


Laird L Weeks 


IS - 


<r 


*,5<* 


,0C 




IT — 


c r- 


4,24* 


-,1£ 




4F - 


ST 


4,19 


,00 


Savage 


9 - 


11 


4,15 


-,08 


$heehan 


4F - 


5S 


-,16* 


-,21* 




4S - 


5S 


+ ,27 


+ ,47* 


S^at-h 


€S - 


9S 


-,06 


4,13 


Syr a cuse 


AT - 




4,75* 




3F - 


4S 


,00 




Van Every 


4F - 


6S 




+ ,51 


Kalberg 


3,4F - 


3,4S 


-,02 




5,6F - 


5,6S 


-,21 






7,9F - 


7,9S 


4,08 








10,125 


-,25 




£dep 


2F - 


2S 


+ ,53 


-,17 



* significant at ,05 level or better by author's test 
a S denotes spring* F denotes fall 
b In standard deviation units 

c First entry uses regijlar segregated control group? second entry uses 
segrecated control group with an enriched program. 
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deviations)* Wortman used a control group of Black students who were 
accepted for tlie voluntary transfer plan but who ultimately turned it 
doxjn* There was another potential control group of. students who were 
accepted, but could not be acconnodated in the transfer program due to 
lack of space. Since this Proup did not differ to any significant 
degree from the "refuser" group, 1 pooled the two groups to improve N's 
and standard deviation reliabilities. Compared to Wortman, this 
procedure yielded higher effects for reading but lower effects for math. 
The author did not compute a formal test so far as I can discern, but 
his discussion implied significant positive effects for 3rd grade 
reading, significant negative effects for 2nd grade math, and no other 
significant effects. 

Bowraan 

The Bowman study is the only one I have included which uses different 
pre-and post-tests (N.Tf. State and Iowa, respectively). One reason I 
included it was the fact that the pre-test showed only modest 
differences between the desegregated and the control groups (about h 
standard deviation), and also because it has a second and novel control 
group: Black students remaining in a segregated school and classroom but 
with an enriched educational program. Interestingly, while there are no 
large effects of desegregation compared to the regular controls 
(although the author reports a significant t-test for reading), there is 
a very large effect (non-significant according to the author) showing 
that segregated enriched students gained more than desegregated 
students, (in the Appendix all means are divided by their respective 
standard deviations, and therefore appear in standardized form,) 
Sensitivity analysis shown later evaluates the effect of including or 
excluding the segregated-enriched control group. 

CarriKan 

The Carrigan study evaluates a mandatory '*one-way*' busing program, 
arising from the closure of a predominately Black school. One might 
object to the control group here, because it was just at 50 percent 
Black, Nonetheless, it was in an area undergoing transition and does 
just barely meet the definition being used here. 

Pre-test means are not shown in the Appendix, since Carrigan did not 
tabulate them for subjects in the study for both the pre- and post-test 
(there were some dropouts and missing data). Given the small N's such 
inconsistencies might bias the standard deviation estimates, so I simply 
pooled all standard deviations for a single estimate, which can then be 
divided into the gain score for the effect size, Wortcan apparently 
used the existing pre- and post-standard deviations (with inconsistent 
N's), thereby accounting for the variations with my estimates. However, 
the estimates averaged across all grades are very close, 

Clark 

Clark evaluated a voluntary transfer program in North Carolina* This is 
the first study in the NIE set where all design criteria are met e^ccept 
pre- and post-standard deviations. Presumably because of missing 
standard deviations^ Wortmaa analysed the SCAT verbal test; although 
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even here only a single standard deviation is available* I have chosen 
the STEP reading test, although the results are similar to those for the 
SCAT* For a pooled standard deviation 1 have used the estimate from 
Savage (see below) whose standard deviation averaged lA at the 9th grade 
level. According to STEP norm tables, the 6th grade standard deviation 
should be about 1 point lower than the 9th, but I have used lA from 
Savage as conservative estimate* Given the small change, <n standard 
deviation in the 13 to 15 range will not alter the effect estimate* I 
also used lA *0 as the standard deviation for the SCAT quantitative test, 
although this is probably conservatively high (thereby producing a 
smaller negative effect)* Fan spread should not be a problem here, 
since pre-test means are virtually identical for the two groups. 

Evans 

This study evaluates a comprehensive, two-way mandatory program in Ft* 
Worth, one of only two such programs in the NIE set* Again, all design 
requirements were met except for pre- and post-standard deviations, so 
we used those from Sheehan, who assessed Black outcomes at the same 
grades in the sister city of Dallas (using the same test)* I 
interpolated for an estimate of Ath grade Spring and 5th grade Fall* It 
should be noted that all standard deviation values here are lover than 
those shown for national nonns^ 

Iwanicki and Gable 

This study is the only one of several evaluating Project Concern, a 
voluntary program in New Haven, Connecticut that qualified under the 
panel*s guidelines. Unfortunately, this study focuses on the second 
year of desegregation, so this factor should be taken into account wben 
interpreting the results* Considering the similarity of the 
pre-treatment means at each grade level, however, (which reflect the end 
of the first year of desegregation), and the fact that the control group 
was drawn randomly from a group meeting Froject_C.pncern*s requirements, 
including agreeing to participate when an opening occurs, it appears 
there were no first-year effects either* 

The study does not include standard deviations, but assuming that Black 
students gain anywhere from ^ to 1 standard deviation in one year (more 
in earlier years), which is the pattern in our data, then the standard 
deviations are probably in the 10 - 15 range* This assumption is 
consistent with white student means reported by Iwanicki which are 
anywhere from 11 to 18 points higher than the Black means* In any 
event, since the similarity of pre-test ineans diminishes the concern for 
fan spread, and since the gains are identical for grades 2 and A, the 
effect size for those grades will be 0 regardless of the standard 
deviation estimate* For grade 6 we used a conservative ef fect^estimate 
of 0, even though the effect would be negative if we had a 'specific 
standard deviation estimate* 
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Klein 



This study of voluntary transfers in the South is one of only two 
studies in ocr set at the high school ]evel. Two control groups were 
avaliabJe* one racidomiy selected from all-Black high schools and one 
matched on T,Q,* The latter group was selected* due to clear selection 
trffecCiJ when transferees were compared to the randomly selected 
controls. We still have a pre-test difference of 7 points, but it would 
be 11 points if the random group was used. Only a single standard 
deviation is iivailable from aci analysis of variance table, so the 
possibility of fan spread caciiiot be taken Into account. However, since 
the control group has a lower pre-test mean and since each group gained 
the same aniount , any fan spread affect should change our 0 effect into a 
nttjative effect, thereby making 0 a conservative estimate. 

Laird and Weeks 

Thii> PhilaceJphia study evaluates a voluntary program brought on by 
overcrowding in a Black school* Students were bused to one of two white 
schools. Day and McCloskey, The Black students bused to Bay were highly 
biased compared to control s*^udents, with both TQ and pre-test means 
averaging ar. or near 1 standard deviation above the controls in grades 4 
and 5 (in fact, their IQ's equalled white means in the receiving 
schools). Therefore the McCloskey students were selected for analysis. 
Since post-test standard deviations differed considerably from pre-test 
standard deviations, time-specific effect estimates were derived. 

The effects in this study are quite large and significantly positive for 
readi.ig at grades A and 5, but negligible and non-significant for math 
at all grade levels. The authors used matched samples for their 
sitjnif icance tests, 

Rentsch 

The results from this two-year evaluation of the volunteer busing 
program in Rochester (grades 3,4, and 5) are excluded from Table 1 on 
oethodological grounds. First, the pre-test and post-test were 
different tests, and the author did not make it clear which tests were 
used and when they wtrre administered. Second, pre-test differences 
between the desegregated and segregated control groups neared or 
exctieded 1 standard deviation* Most devastating of all, information 
received after the panel had selected this study revealed that white 
students were included in the study, and the selection method used for 
the bused students makes it highly likely that the desegregated group 
had two to three times as many white students aj; the control group. 
This possibility could explain why the desegregated group had such 
higher pre-test means. 

The average reading effect for the three grades in the Rentsch study is 
+.50, while the average math effect is -.11, Sensitivity analysis will 
show the effect of including or eMcluding this study on my overall 
conclusions ^ 



This evaluation of a Richmond, Virginia voluntary evaluation plan Is the 
only study in our set to investigate the high school level* Three of 
the four standard deviations for reading were about equal and similar to 
published norms, but a fourth was 2^ times larger (poRt-test for 
controls) and reflected a possible computational or typing error* These 
three standard deviations were pooled tor reading; pooling was done 
separately for pre- and post-standard deviations for math due to 
fan-spread indications* 

Sheehan 

This study of the Dallas plan may be especially significant because of 
its large N (nearly 2,000 students) » a tine span of two years, and being 
the only other evaluation of comprehensive two-way mandatory busing in 
this set* VThile the negative effect of desegregation is not large here, 
the size of the N renders it statistically significant — the only such 
negative effect in the set* 

Slone 

An example of pairing is illustrated in this New York City evaluation, 
although it was implemented in only a few schools* The desegregation 
started in Fall, 1964, but the pre-test was given in Spring, 1965, so 
this study also represents a test of second year effects* On the other 
hand* Slone presents reading tests from Spring* Grade 3 (1964) showing 
that the desegregated and segregated groups started out with the same 
relative difference in reading achievement (25*5 months vs* 21*5 months) 
prior to desegregation* These pre-test differences of about h standard 
deviation would make pre- and post-standard deviations desirable* but 
they are not available* Only a single pooled standard deviation is used 
for the effect estimate* 

Smith 

This Tulsa, Oklahoma study is the only one in the NIE set to study 
school desegregation due to residential patterns; it Is also one of the 
longest-term studies* The desegregated schools have a higher proportion 
Black than the other studies, averaging about 42 percent* 

Syracuse 

This study evaluated an "open enrollment" busing program in Syracuse, 
New York* Matched and unmatched controls were available; only the 
matched groups were used here* The control group for the 4th grade 
group was drawn from a different school than attended by the bused 
students originally* An overall standard deviation estiiaate was computed 
from a t-statistic; since the groups were virtually equal at pre-test, 
no fan spread correction Is required* 

A third grade group bused for two years to another receiving school is 
also reported In Table 1, but not analysed by other members of the 
panel* This group Is of considerable interest because It is longer-rem 
and, especially, its control group is drawn from the same school as the 
bused group* Only gain scores are reported, but the author reports that 



the matching was successful and that there were no significant 
differences between bused and matched control students* The standard 
deviation estimate is borrowed from Beker's 3rd grade. Spring and 
Smith's 6th grade Spring estirnates* but its size is immaterial given the 
equality of the gain «cores< 

Thompson and Snidchens 

This study was excluded because the **segregate<i" control group averaged 
only 42 percent Black* Sensitivity analysis will assess the impact of 
this exclusion on our final effect estimates. 

Van Every 

This is a unique study of school desegregation brought on by a new 
housing project located in a predominantly white school attendance zone; 
the control group is drawn from a Black segregated school with 
socio-economic characteristics comparable to the desegregated group* No 
difference between pre- and post standard deviations was found, so one 
pooled estimate was used* Although Van Every reports a .non-significant 
post-mean difference* there appears to be a calculation error* Both the 
reading and math differences appear to be statistically different* 

V; a lb erg 

This study evaluates the Boston METCO program, a voluntary 
city-to-suburb busing plan like Project Concern* Grades 3 and 4 are 
combined* as are 4 and 5, and so on* due to small N's in the control 
subjects. No differences between pre- and post-standard deviations were 
observed, so over-all pooled estimates are used at each grade level* 
Math results are unreported here because of unreadable figures on 
xeroxed copy* 

Zdep 

The final study evaluates another voluntary metropolitan plan* The pre- 
and post-tests are from the same publisher, but the two different forms 
are not directly comparable and hence the raw score "gains*' presented in 
Table 1 are presented only so the reader can derive post-treatment 
means* When converted to standardized "scale" scores from published 
norms, the bused group gained 4 more points on reading and lost 2 on 
math when compared to the control group (the national standard deviation 
of the scale scores is 10)* Zdep found one of the largest effects on 
reading in the set» but the small N renders it statistically 
non-significant * 

T he Wortman Effects 

The Wortman formula alvays computes effect estimates separately for time 
1 and time 2» and uses only the control group standard deviations* One 
can see from the Appendix that whenever identical groups and tests are 
being assessed, in most cases my estimate agrees closely with Wortman's* 
The main discrepancies ^rise in the Carrigan and Walberg studies* where 
absence of pre- and post-means on the same group of persons led me to 
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use only the gain scores and a pooled standard deviation. Even for 
these studies the effect estimate averaged across all grade levels is 
very similar* The discrepancy in the Beker study arises because X 
combined two groups of segregated students for the control group: those 
who "refused" to join the busing program* the group used by Wortman* and 
those who accepted but could not be accommodated* 

The Important difference between the Wortman formula and the approach 
used here is the number of effect estimates obtained- By pooling 
standard deviations and by estimating standard deviations from other 
information, effect estimates are obtained for every study* Even though 
a precise standard deviation is not available* in many cases the 
treatment ^control Initial scores and gain scores are so similar that the 
effect will be near zero no matter what standard deviation Is used* 
These near-zero effects can have a significant impact on overall effect 
estimate averages. 

DISCUSSION 

Although the number of studies In the set reviewed here is not large, 
the advantage of the panel's approach is that most studies exhibit 
above-average methodology* and most appear to be carefully conducted. 
Most important* each study meets reasonable standards for possible 
causal inference: a pre-post design with a control group* What is lost 
ir numbers* then* is gained in design quality* which is essential in 
arriving at a sound judgment about the impact of desegregation on Black 
achievement* 

The studies also exhibit a variety of desegregation settings and types, 
although they are weighted more towards voluntary programs than 
niandatory* a definite limitation for generalization* On the other hand, 
for this reason this set may provide a good test of the hypothesis* 
since it is probably the case that voluntary programs offer better 
opportunities for positive effects more support from the community* 
self-selection of families most desirous of the experience* and so 
forth- 

The other major restriction on generalization is that the Icngest'-term 
study here is only three years in duration* thereby complicating 
Inference tor desegregation experience spanning the whole school cycle* 
Given this panel's search* apparently there are no longer-term studies of 
adequate quality* 

Taken as a whole, what do these studies tell us about desegregation and 
Black achievement? There are several ways to approach an answer to this 
question. 

First, we can consider the significant tests carried out by the author 
of each study* Of the different grades and tests in these studies 
that were subjected to statistical analysis* only 11 were found 
significant at an acceptable le\'el* and two of these were negative 
effects. We would add three more significant results out of 53 possible 
if the Rentsch study were to be added to the set* Thus the overwhelming 
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majority of these studies* taken individually* found no significant 
effects of desegregation on Black achievement. 

The meta-analysis technique employed by the panel provides a second and 
more reliable method tiiat goes beyond thi? simple counting exercise* We 
can arrive at an overall assessment of desegregation's impact by 
averaging the size of effects across all studies and grade levels. 1 
adopted two alternative strategies in computing these overall averages* 
First* I computed the average of the effect estiiaates shown in Table 1, 
which reflects a group of studies that differs somawhat: from the total 
group adopted by the panel. Second* for sensitivity purposes* I 
averaged effects for the original set of studies as selected by the 
pan&l* This second set of averages therefore includes results from the 
Rentsch study and the Thompson and Smidchens study and exclud&s the 
extra grades 1 analysed from the Bowman and Syracuse studies* 

TJie average effect sizes are shown in Table 2* For the set of studies I 
selected* the average effect is *06 of a standard deviation for reading 
and *01 for math* Neither of these two average effect sizes are 
significantly different from 0 by statistical test* When we consider 
those studies as originally adopted by the panel* the effect for reading 
rises to .11 and the math effect falls to 0* The reading effect is 
still not significantly different from 0* The average reading effect 
size of *11 for the panel's original studies is somewhat smaller than 
Wortham's average effect, primarily because of his decision not to 
calculate effect estimates for a number of studies with effects near 0 
(due to incomplete standard deviation information)* 

For the sake of di scus sion * le t us assume that the more libera 1 effect 
estimate of .11 for reading held up across a larger number of studies, 
so that it would be statistically significant. We must still decide 
whether ^ reading effect of this size would be educationally 
significant * 

First, we must keep in mind that the unit of measurement here is 
variation in Black scores, which is known to be smaller than that for 
Black and white studencs combined, or for national norm data* perhaps on 
the order of two-thirds or three-fourths. Therefore, even if one found 
an effect of *11 in a larger group of studies* the effect in terms of 
national norms is still less than *10 or less than one month of a school 
year* Since the achievement differential between Black and white 
students averages between 1 and 1.5 standard deviations, an average 
effect of .11 for Black reading achievement means that desegregation 
alone could close the gap by less than 10 percent* 

Second* such an effect might be educationally significant if it was 
cumulative over time; that is, if a Black child gained *11 or one month 
of a school year for each year the child was in a desegregated school. 
Is there any evidence for such a possibility in this group of studies? 
This possibility can be tested to some extent by dividing up studies 
according to duration and computing average effects for one^year studies* 
two-year studies, and three-year studies. I have carried out this 
analysis for reading scores using the panel's original 35 grade levels. 
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TABLE 2 



THE AVERAGE EFFECT OF DESEGREGATION ON BLACK ACHIEVEMENT 



Averaoe Effect Eize^ 



Etudy Grouping Reading Math 

Table 1 Studies .06 .01 

(N)b (33) (18) 

Original Panel Etudies .11 .00 

(N) (35) (22) 



a In fractions of standard deviation. One-tenth of the 
black student standard deviation (.10) is equivalent 
to about one month of educational growth as measured by 
most standardized tests. 

b Nuniber of grade levels for vhich the average is computed. 
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If desegregation effects are cumulative^ one should see Increasing 
effects sizes as the duration of desegregation increases. 

The results for reading are summarized in Table 3, The average effect 
is +.04 for one-year studies, +.37 for two-year studies, and -,16 for 
three-year studies. While the two-year studies do have larger effects 
on the average than one-year studies, the three-year studies show an 
average negative effect (due largely to the Van Every study). 
Therefore, there is no evidence from these studies — the best 
available — that there is any cumulative effect of desegregation. This 
conclusion must be qualified, of course, by the fact of the relatively 
small number of cases for any given duration period. 

What about the grade at which children are desegregated? When, we 
compute average effects by grade level, the studies here reveal average 
effects of -.55 for desegregation begun at grade one (one study), ,35 
for grade 2, and inconsistent effects near zero for other grades. This 
set of high-quality studies does not support Grain and Mahard*s finding 
of large effects for grade 1 (and kindergarten) but no effects for grade 
2 and higher grades. 

Finally, it is noted that there are several studies with very sizable 
reading effects; Anderson, Syracuse, 2dep, one grade from Laird and 
Weeks, and two grades from Rentsch. Without these six grades (out of 35 
in the set), the reading effect would be near 0. Therefore, even the 
overall average reading effect of .11 is not a consistent effect of 
desegregation. It would be more accurate to summarize our studies by 
saying there are six grades with substantial reading effects ranging 
from .5 to .8 and 29 grades with naich smaller reading effects that 
average out to about 0, 

Ko natter how one summarizes these desegregation effects, the conclusion 
is inescapable; the very best studies available demonstrate no 
significant and consistent effects of desegregation on Black 
achievement. There is virtually no effect whatsoever for math 
achievement, and for reading achievement the very best that can be said 
is that only a handful of grade levels from the 19 best available 
studies show substantial positive effects, while the large majority of 
grade levels show sinall and inconsistent effects that average out to 
about 0, 

The fact that only a small fraction of these studies show substantial 
effects, even though all grade levels were desegregated, suggests 
strongly that factors other than desegregation are the real causes of 
the large achievement gains documented in these studies, We have no way 
to investigate what these factors might be, but one hypothesis is that 
they are due to unique educational programs available in those few 
schools. Indeed, given the much larger effects demonstrated in many 
purely academic interventions (see Walberg*s paper in this volume for a 
discussion of some of these interventions), this hypothesis may be the 
only reasonable explanation for the considerable variation observed in 
the panel^s selectee studies. 



59 



TABl^E 3 



THE EFFECT OF DESEGREGATION ON BLACK READING ACHIEVEMENT, 
BY YEARS OF SEGREGATION* 



Length 



Average Reading Effect Size 



One year 
Two years^ 
Three years^ 



+.04 (N=23) 
+.37 (N=9) 
-.16 (N-3) 



a Using only the original panel studies, including 
Pentsch and Thompson I Smidchens. 

b Anderson, Laird t Veeks, Pentsch, Savage and Sheehan< 

c EovOTian, Smith and Van Every, 
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IMPLICATIONS FOR POLICE 

Although tht; findings of each paper in this volume differ to some 
extent* the range of difference is snail in comparison with previous 
debates on this issue. With the exception of Grain, all panelists find 
no effects for math achievement* and find that reading effects are 
positive but quite small and not educationally significant in all but a 
few studies* Perhaps a majority of the panel also agrees that the 
average reading effects are considerably smaller than what might be 
ejtpected from special educational interventions. 

What* then* should the policy directions be from this consensus of 
experts? It seems to ne there are four audiences whose future actions 
might be influenced by these res'alts. 

The connnunlty of educational researchers might justifiably decide that 
enough research has been done on the issue of desegregation and 
achievement, and that their energies and resources should be devoted to 
more fertile pastures. There will be some, of course* who will find 
sufficient flaws in all 19 of these "best" studies to recommend one more 
large-scale* well-funded study to provide a definitive answer* I would 
not quarrel with such a study, but at this point the probability of a 
negative or indeterminate answer (given current knowledge) is high* 
thereby making its cost hard to justify. 

For educational policy makers, I think these results offer an excellent 
opportunity to reconsider priorities for programs designed to enhance 
minority student achievement • Desegregation is simply not a 
cost-^ef f ective technique to accomplish this goal, Hovever desirable 
racial balance itay be for other purposes, it is not going to reduce the 
achievement differential between white and Black students. It is time 
to solve educational problems vith educational solutions, and many 
promising directions are documented in the Walberg paper. 

The courts and civil rights activists should also take note of these 
findings. The studies reviewed here tell us nothing about whether 
segregation caused the Black-white achievement gap, but they do tell us 
that desegregation by itself will not close it to any important degree. 
There is controversy about the role played by achievement issues in the 
original Brown decision* but there is no question that many lower courts : 
have been influenced by achievement results vhen fashioning 
desegregation remedies. One hopes that the results here will relieve 
judges of the misconception that they are benefiting the acadenic 
progress of minority students by ordering desegregation plans. 

Finally, these findings may offer relief to many Black parents who have 
willingly endured the hardships of crosts-town school transfers because 
of the mistaken belief that their children will benefit academically. 
Many will continue to endorse such transfer for other reasons* but many 
others may well be happy to discover that their child can get just as 
good an education in a neighborhood school close to home* 

This does not mean we should abandon desegregation: it remains a goal 
all panel members share, I t^1ink it does raise serious questions about 
compulsory desegregation methods si:ch as mandatory busing* There is 
little justification for forcing parents and children into expensive* 
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time-consuming cross-town bus rides when there is no educational 
advantage. For those of us who want to pursue the goal of integrated 
educatiouf we should support comprehensive voluntary transfer programs, 
on a metropolitan basis where necessary. It should be made clear to all 
participantsj however* that simply changing to schools that are more 
racially balanced than one's neighborhood school is no guarantee of a 
superior education. Indeed* they may be giving up possible advantages 
of special programs in their own school — programs designed specifically 
to enhance education and proven to work. 
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THE EFFECT OF DESEGREGATION ON BLACK READING AND MATH ACHIEVEMENT 



Study £in6 
Grtidc/Yca r 



Test iind 



Scqrcqatfid 
Pre X G^,iin. 



Gain^ 
Gain. 



Poo\eO s6 
(T^/T^) 



Wor tman Author 
Effect Effect Test 



Anderson 



Metro (T-scores) 



2/60 - 4S/63 


(34/34) 


44, 3 


2,3 


46*4 


4*8 


+ 7, 


1 


8*0 




+ *89 


+ *95 


+ 


MATH: 


( J4/34) 


44,6 


3*6 


43,8 


1*3 


+ 4, 


9 


9*0 




+ *54 


+ *53 




Beker 


St^tfi t'ord 


(GE months for 




meaning) 














2F/64 - 2S/65 


(25/32) 


] 5,9 


6,7 


16, 3 


5*2 


+ 1, 


5 


2,3/6, 


7 


+ * 


+ *23, 




3F/64 - 3S/65 


(11/28) 


24,2 


8,5 


20,0 


5*5 


+3, 


0 


6,6/8, 


9 


+ , 17 


-,04 




MATH ; 


(2^/32) 


15,6 


4,7 


16,7 


7,1 


-2, 


4 


4*3/6, 


6 
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+ 


4,0 
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Test and Desegregated 
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Gainjj— 


Pnnl^d sd 




Wortman 


Author 


Grade /Year 


(Njj/Ng) Pre X 
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Pre X 


Gain 


Gain^ 

s 


(Tj/Tj) 


Effect 


Effect 
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Weeks 


Philadelphia Achievement 
















lS/63 - 
3F/63 - 
4F/63 - 


4F/65 
5F/65 
6F/G5 


(20/140) 3.7 
(13/140) 7.2 
(JO/147) 8.4 


S.l 
4.2 
4.1 


4.2 
6.7 
9. 1 


4,0 
2.2 
3.7 
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+ U . 1 
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4- 

U 
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u . ^ 
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n 
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u 
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- . 08 


05 


U 
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— . lb 
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-2.1 
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34.9 
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0 
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^.4 
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-+ 
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- 4S/66 
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4- 
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5tu6y and 


Test and Desegreqated 


5egre9dtpd 


Gainp- 


Pooled fid 




Kortman 


Author 


Grade/Year 


(Np/Ng) Pre X 


Gainp 


Pre X 
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Gaing 


(Tj/Tj) 


Effect 


Effect 
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Stnith 


5tanford (Ftaw score for paragraph meaninq) 
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18.1 
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-1.2 


ft ft /I ? n 
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n 
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+ .13 
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SUA (GC months) 


















4F/66 - 65/69 


(20/21) 31.6 


Zi.5 


29.4 
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3.2 
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(12/15) 14.5 
(12/15) 26.3 



(T^aw scores—pre is 12A, post is 23A) 

8.4 16.0 4,5 +3.9 6.9/7.8 +.53 +.65 0 

-1.9 26.3 -1.0 -0.9 6.8/5.4 -.17 -.15 0 



♦Estimated from 5avage ♦♦Estimated from 5heehan 
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Is Kineteen Really Better Than Klnety-Three? 
Robert L, Grain 

The Rand Corporation and 
The Center for Social Organization of Schools 
Johns Hopkins University 



In this volume, a group of scholars have come together to assess the 
state of our knowledge about the effects of school desegregation on 
black achievement test scores. The scholars were selected to represent 
a range of personal ideologies. Thus this project should provide a 
near-perfect opportunity to array a group of social scientists along a 
continuum from left to right and demonstrate that the scientific 
conclusions they draw are consonant with their personal politics. Doing 
so would present strong evidence that our worst fear is true — that 
social science is not really science* and government* in employing 
social sc* ;nce, has merely been financing propaganda. Perhaps one can 
draw this conclusion from the panel's work* but I don't think so, 

Firstt it is not so easy to attach political positions to working social 
scientists. It makes good sense to classify me as a "liberal;" I have 
testified in a number of court cases, and while this has sometimes been 
as a court-appointed expert or on behalf of a school board resisting 
desegregation* it has usually been as an expert called by the plaintiffs 
in a suit trying to bring about desegregation. Other members of this 
panel have testified for school boards resisting desegregation or have 
been c:2lled to present the anti-busing position in congressional 
hearings. But in at least two cases putting labels on members of the 
panel is not so easy to do, Paul Wortman was selected as a liberal 
mainly because he had completed a literature review showing positive 
effects of desegregation on black achievement; and Walter Stephan was 
selected as a "neutral" because he is the author of an earlier review 
concluding that there were few positive effects of desegregation. But 
every scientist whose data support a black position is not necessarily a 
liberal* just as every scientist who agreed with Copernicus was not 
anti-'Christian, 

It is also not so easy to show a correlation between personal ideology 
and scientific position. It is true that 1, the obvious liberal on the 
panelf am the co-author of a literature review (Crain and Hahard, 1982) 
arguing that desegregation seems to raise Black achievement by -3 
standard deviations, a larger estimate than any other member of the 
panel has made; and the panel*s most obvious conservative, David Armor* 
has produced the smallest estimated achievement effect of any member of 
the panel. But if political position were dominant here* its effect 
would have ro appear in the way the panel selected the 19 studies it 
considered best, Paul Wortman read the studies gathered by Mahard and 
me (1982) and by Krol (1978) and recommended to the panel a group of 31 
studies as being of superior quality; the 18 that the panel chose to 
accept from that offering are in fact only slightly less positive in 
their assessment of desegregation than the ones they declined to use. 
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There is little evidence of bias in their choice. It is true that when 
the panel veered from its normal course of using only the data provided 
by Wortman, it did so to add one study which had found a negative effect 
of desegregation and to add additional data strengthening a second study 
in the group of 18 which had found a negative effect. But this is not 
very strong evidence for an ideological interpretation of the actions of 
the authors. Finally* one might simply note that when the liberals* 
Grain and Hahard, reviewed the literature on desegregation* they 
gathered together 93 studies whose mean effect of desegregation on black 
achievement was +,08 standard deviations* pooling reading and math 
effects together; the conservative David Armor reviewed 19 studies and 
found an effect on reading scores of +,11 and on math scored of .00"an 
average of ,055, It is hard to believe that approximately 180** of 
political ideology are accurately translated into the selection of two 
samples whose mean treatment effects differ by only ,025' standard 
deviations , 

Ideology does appear in some of the essays in this volume, including 
this one; but it tends to show up mostly in the conclusions and 
interpretations — in the words rather than the numbers* One raason it 
does not show in the numbers is that it is very difficult for 
contemporary social scientist to disagree about methodology. The 
technique used here for assessing effect si£;e was proposed by Wortman as 
neither a liberal nor a conservative solution; it was accepted by all 
the members of the panel regardless of personal ideology. 

But this is not to say that there are no differences worth noting among 
the panelists* or that these differences have not consequences. There 
is an important division among the members of the panel* but on a 
methodological* not ideological, issue-- the question of whether one* in 
reviewing literature, should select only the better studies and 
concentrate on them* or review all the studies one can find. There is 
In this panel a rather neat correlation between the number of studies one 
chooses to look at and the size of the effect of desegregation one 
finds, Craln and Mahard, using 93 studies, conclude that desegregation 
raises black achievement something on the order of 1/A to l/3 of a 
st^tndard deviation, Wortman* reviewing 31 studies* concludes that the 
gain is perhaps 1/5 of a standard deviation* The others, using 19 or 
fewer studies* conclude that desegregation raises black achievement hy 
perhaps l/8 of a standard deviation or perhaps less, I would like to 
argue that in this particular case* it is not an accident that th^ 
number of studies reviewed is related to the conclusions drawn. 

The question of whether one should selectively review literature or 
review all of it has been a subject of considerable debate among 
scientists using what is now called meta-analysis — the computer^assisted 
review of studies of a particular question. At first thought, the 
argument that one should choose the best studies and leave the chaff 
aside ceems unquestionably the right ans:wer* Certainly the 
counterargument that one should include all the studies because error is 
a random variable — that with a large enough sample of studies errors 
will cancel themselves out and reveal the truth — seems quite inadequate* 
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Selection of the good studies seems like the obvious answer only as long 
as we sleepily think that our task is only to find the competent 
evaluations of a particular program and compute an overall average 
program effectiveness score. Most of the meta-analyses done to date and 
most of the literature reviews discussed by Herbert Walberg in this 
volume are in fact of this type, but there is no reason they must be 
this simple. First, one often wants to know more about a new 
intervention than simply whether it works; we often need to know how 
and why as well. And even if we only want to know whether there is an 
overall treatment effect, there are better ways than throwing away most 
of the research. Suppose there are 100 studies of an innovation. 
Rather than choosing the ten supposedly best studies and computing an 
average effect size, one might include all LOO studies in the review, 
choosing by empirical statistical analysis the 10 best. Alternately, 
one might evaluate all 100 studies and assign different weights, such as 
is done in survey research, to those studies which are particularly weak 
or strong; rather than counting each study equally, one might count the 
particularly weak studies as being only a fraction of the better 
studies, Alternately, one might do as Hahard and I did and construct an 
additive model, assuming that any study which had a particular weakness 
would overpredlct or underpredict the treatment effect by a fixed amount 
"x," and then estimate x through some statistical procedure. All three 
of these alternatives are ways of emphasizing the best studies after an 
empirical analysis of all of them. All else equal, of course ve would 
prefer to select the best studies from a group through an empirical 
analysis rather than from an a priori judgment. 

Viewed this way, the only argument in favor of prior selection is that 
o£ efficiency. In many cases this can be a convincing argument. With 
limited resources one cannot afford to spend vast amounts of time wading 
through dozens of weak studies in order to gain a modest amount of 
information. Given the short duration of this project, it might have 
been impossible for the panel to review all lOO'^odd studies of 
desegregation and Black achievement. Perhaps selecting a small group 
wa!5 the only workable plan. But this does not mean that it: was a good 
plan. 

In this paper we will argue^ first, that selection of a small group of 
preferred studies from a pool using criteria chosen in advance of 
examining tbe studies is in principle a mistake. We will then go on to 
show that in this case, a mistake in principle was also a mistake in 
practice; the panel, in selecting 19 studies from the pool of 100, led 
themselves into a serious error. 

The Theoretical Problems with Prior Selection 

The analogy to weighting in survey research is useful. In surveys. It 
is often the case that particular classes of respondents are especially 
valuable for analysis, and these respondents are oversampled. However, 
the total sample Is then no longer representative of the general 
population. The solution is to assign a weight, a multiplier, to each 
of the oversampled cases so that if three times as many cases in one 
particular class are selected, each is treated as only 1/3 of a case in 
the final analysis. The selection of some studies to Include in a 
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meta-analysis while others ore rejected is essentially a decision to 
assign a ^jeight of 1 to some studies and a weight of 0 to all others. 
The simplest way to justify doing so is to divide the studies into a 
small number of discrete categories, arguing that every study in certain 
categories is worth examining while none of the studies in the other 
categories is< Unfortunately, anyone that has read literature such as 
the desegregation-achievement material knows how difficult it would be 
to justify doing this. 

If one does not ac*jept the idea that the studies can be neatly divided 
into two discrete categories* one good and one bad, then a more 
systematic approach is to rank the studies by quality, putting the best 
studies at the top of this list and then moving down the list until ve 
find an appropriate cut-off point so we can discard studies below a 
certain level of quality. There are several problems with this 
approach. The first is that study quality is a multi-dimensional 
concept; a study which is good in one respect may not be in another. 
Even if studies that are good in one respect tend to be better than 
average in others, how does one choose to rank one study which is very 
good in category A and only moderately good in category B above or below 
another study which is very good in B and only above average in A? 
UhiJe I have not attempted a formal proof, I believe that the Arrow 
paradox (1951) can be used to show that such a ranking is impossible 
unless one it; willing to assign definite numeric values to, for exaiaple, 
the relative merits of increasing the sanaple size versus using a pretest 
measure oi higher reliability. If it is not possible for one person to 
rank the studies unequivocally from best to worst, it is certainly 
impossible for a group of scholars to do so"meaning that one cannot 
expect the readers of a meta-analysis to agree with the author that the 
right decision has been made about study selection. 

At this point the reader may argue that I am being a bit pedantic; that 
ail science is imperfect, and more importantly is dependent on scarce 
resources. With only a certain amount of money and time available, one 
should not spend it rooting through hundreds of useless studies, 
carefully recording all their faults- If one used the weighting 
procedure suggested earlier, one vould have to read each study, enter 
its data into the computer, and perhaps compute weights designed, for 
example » to minimize the variance in the overall estimate by assigning 
low weights to classes of studies which have relatively large 
variability in their estimates of treatment effect* Alternately, if one 
uses the algebraic model that Grain and Hahard used, one must run 
regression equations trying to estimate the proper amount to add or 
subtract from the treatment effects generated by studies of a particular 
kind. All of this takes time and money away from the main objective, 
which presumably is to find the best studies and see what they say. 

It seems to me that the best way to settle this argument is empirically. 
We have here an example of each kind of research* Can we compare them 
and conclude whether the selection of a small number of supposedly 
better studies is a wiser strategy than a brute force analysis of the 
entire literature? 
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The Real-World Problems with Prior Selection of Desegregation Studies , 

The problem with selecting the best studies of desegregation and black 
achievement Is not merely that the multiple criteria which can be used 
for selection are Imperfectly correlated; the criteria are In fact 
negatively correlated. The data which Hahard and I assembled on the 93 
studies deinonstrate this. Methodologically superior studies presumably 
have larger sample sizes* longitudinal research designs, and evaluate 
situations which more accurately represent the policy being 
Investigated. In this case* more recent desegregation plans are more 
Interesting to study than earlier desegregation plans because they 
presumably represent contemporary policy more accurately; and the 
students being studied should be students who have experienced 
desegregation from kindergarten or first grade, since that Is the way 
desegregation is done in perhaps 95% or more of all desegregation plans 
In the United States. Table 1 shows the Intercorrelatlons among these 
four criteria. 



Table 1: Correlations among Study 
Methodological Attributes 
and Study Outcomes 



Late Early 

Samp. Longlt. Date Grade Elf^^ct 

Size Design Deseg, Deseg, Size 

"Quality" 

Sample Size (Large) — -.23* ,33* -.10 -.04 

Longitudinal Design Yes) -,23* — ,03 -.05 ,13* 

"Representativeness'' 

Date of Deseg, (Later) .33* ,03 — -.19 *-,08 

Grade Deseg, began -,10 -.05 -.19* — ,24* 
(at early grade) 

Outcome: Effect Size (+) -,0A ,13* -,08 ,24* 



The correlations are, on the whole, negative. Studies which have large 
sample sizes tend not to be longitudinal. The more recent the 
desegregation plan being studied, the less likely it Is that the study 
will be of students who were desegregated at kindergarten or first 
grade, (The latter negative correlation Is almost a necessity since a 
brand new desegregation plan has not had time for its youngest students 
to reach an age where they can be easily tested.) If one wants to 
choose the best studies from among this field* there are hard trade-offs 
to be made. 
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Only two studies (18%) of students de.segregated at kindergarten arc 
longitudinal. The reason is obviouS":.t is difficult to pretest 
students who have not yet learned to read. And neither of the^e two 
studies were selected by the panel. The second column shows the 
percentage of studies at each grade selected by the panel, Mahard and I 
found a total of tweuty studies of desegregated black students with 
desegregation beginning in kindergarten or iiirst grade and which 
contained a segregated black control group. The panel used the data 
from only one of these studies. The remaining nineteen studies cere 
discarded, usually because these very young children did not provide 
accurate pretests for longitudinal analysis* Eight of the twenty 
studies we identified used cohort comparison — comparing the scores of 
kindergarten and first grade students after desegregation to the scores 
of the students who had been in kindergarten and first grade the 
preceding year. The panel, making a rather conventional scientific 
decision, had judged these studies to be of inferior qufility and 
excluded them. While it is true that in principle a cohort comparison 
is inferior to a longitudinal experimental or quasi-experimental design, 
this is precisely an example of the situation where there are competing 
methodological criteria, and the choice cannot be wisely made in advance 
of looking at the data, In this case a cohort study is superior because 
it enables us to study students who had begun desegregation in first 
grade. 

Estimating the Effect of Desegregation 

The nineteen studies selected by the panel of scientists show an overall 
effect of desegregation on achievement which is slightly more positive 
than the Crain-Mahard larger sample. Whereas we find an average 
desegregation effect in all 93 studies of ,08 standard deviations, our 
estimate for the 18 of our studies selected by the panel is 
significantly higher, ,16, This is likely the result of discarding 
non-*longltudinal studies. If desegregation has a positive effect, then 
it follows, as Wortman notes, that accurately done desegregation studies 
will show a positive effect and the panel's exclusion of technically 
inferior studies should produce a higher estimate of the effect of 
desegregation than our strategy of including every study regardless oi 
quality. We arrive at this same conclusion in a different way. By 
coding the different types of research design as a variable for each 
study, we show that technically better research designs are correlated 
with more positive effects of desegregation* As Table 3 indicates, 
studies in which the performance of blacks in desegregated schools ore 
compared to perfoncance of whites, or the performance of the testmaKer's 
norming sample, often conclude that desegregation has failed to improve 
black achievement. On the other hand, studies which compare 
desegregated blacks to segregated blacks — either in a '"cohort" design 
(the segregated blacks are the students in the same grade in the years 
before desegregation), a "cross-sectional" design (with no pretest) or a 
longitudinal design — are twice as likely to show positive as negative 
results; and randomised experiments show positive results eight or nine 
tinges as often as negative results. 
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T«blc 3: Clrectlan mjii Slxc of TrcBtcest Effect » 
by Type of Control Croup 
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The problem with the research panel's approach is that by excluding 
supposedly interior studies by one criterion* they have managed to 
exclude tfiost of the experiments and all of the studies (except for 
Carrigan) in which students were desegregated in kindergarten or first 
grade. Figure 1 shows a plot of the effect sizes estimaced by Mahard 
and Grain for 28 samples of students In the eighteen evaluations 
selected by the panel, Tliis is shown as a heavy line, which changes to 
a dashed line where it joins dots based only on one or two samples of 
students , 

The effect sizes for the entire group of 295 samples in the 93 studies 
we reviewed are shown as a light solid line. In grades 2 through 5 
(where the bulk of the samples studied by the panel began 
desegregation), our estimates of effect size for the panel's studies is 
considerably higher than our estimate for the larger set of studies. 
The graph also shows, using the letters A and S, the effect size 
estimates for each grade computed by Armor and Stcphan, In the range 
from second grade through fifths their estimates are also generally 
higher than our estimates for our larger sample, Thuf.^ we again see 
that the more selective sample shows higher estimates » presumably 
because it has discarded the very weak designs which are biased toward 
underestimating the effects of desegregation. At the same time, the 
other point of this graph is that there are no data points in the 
panel's nineteen studies for kindergarten and only 1 data point for 
first grade, (The one first-grade datum is regrettably the rather 
untrustworthy estimate by Carrigan, which uses a 50% black school for 
its control group). Also shown on the graph is a circle located above 
first grade* at approximately +,30 standard deviations, indicating the 
estimated effect size predicted by our regression equation for a typical 
study of students desegregated at first grade using a randomized 
experimental design * If one were willing to assume that Armor's and 
Stephan's data supported the early grade effect, an extrapolation down 
to grade one from their date would seem consistent with the estimate. 
Unfortunately, given the relatively small numb-^r of cases and the rather 
ragged pattern in the data, it is difficult to say whether either 
Stephan's or Armor's calculations support the hypothesis chat there are 
stronger effects at lower grade levels, 

The problem is again made more difficult by the prior selection of 
studies which has reduced the number of cases so greatly that it is 
difficult to compute reliable correlation with the data. The best data 
on the question is the Grain and Mahard analysis. Table 4 presents that 
data, and shows a quite strong pattern. Of 55 studies of students 
desegregated In kindergarten or first grade, 45 (82%) show a positive 
desegregation effect. 
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Table 4: Direction ani Slie of Tre»t©ent Effect, 
By Grade at Initial Desegregation 
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fitudents, and carried out an analysis of covariance. He does not 
report the actual raw means, but the obtained f of suggests that 

there rjust have been a difference of 1/3 standard deviations favoring 
the exptrimental group- 
Thomas Mahan Cl97J) was director of the Hartford Project Concern program 
at the time, and conducted nls own evaluation* He used data during the 
second year of the project, so that presumably his results are more 
biased by attrition from the original random treatment and control 
group than are Wood's* For the second year of the project, Mahan shows 
» an average 9-point increase in IQ for the treatment groups who entered 

the program in the lirst grade, compared to control group increases of 3 
and 2 points respectively. There are also large differences favoring 
the treatment group tor students who entered the program in grades 2 and 
3 and negative treatment effects for students who entered the program in 
grades 4 and 5- Mahan also reports the results of achievement testing 
using the Metropolitan Readiness Test which showed some significant 
differences for the kindergarten group favoring the bused students, and 
also some results from the Primary Mental Abilities Test which showed 
results for both kindergarten and first grade students favoring the 
experimental group. 

Project Concern operated in several cities in Connecticut* and Joseph 
Samuels wrote a dissertation Cl971) evaluating the l^ew Haven program. 
He compared 37 students who transferred to the suburbs at kindergarten 
to a control group of 50 students. There are possible biases here. In 
that Samuel^s transferred students were apparently screened after being 
randomly selected to drop students who "had medical or psychological 
reasons precluding their involvement He does not say how many 
students were onitted In this way. In addition, the control group was 
limited to students who remained in the same school for two years* which 
pr^isumably would bias the control group upward. If there were 
differences between the two groups » they do not appear on the Monroe 
ttead;.ng Aptitudr Test administered to the two groups while In 
kindergarten; the experimental group tested only ,03 standard deviations 
higher. Two years later, the treatment group tested 5-5 units higher on 
a reading test with a standard deviation of 12, They also tested 5*6 
units above a group of students in a compensatory education program in 
the city, both differences being significant • The Project Concern 
students did not test higher than the control group in either word 
analysis or mathematics— they were about ,25 standard deviations lower 
on both teitts. 

Meanwhile, the Itochester city schools carried out a similar 
city*- to -suburb program (Rock, et al, , 1968)* In each of three years 25 
experimental subjects were selectee and allowed to transfer to the 
suburbs while 25 others were held as a control group in the central 
city. The experimental group scored below the control group on the 
pretest (the Metropolitan Readiness Test)* At the end of the first 
year, the treatment students did not score higher on the Metropolitan 
Achievement Test, but did score one-half ear ahead of the control group 
on the SKA battery. The second experimental group also scored below 
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their control on the Readiness lest, but after one year scored about 
three months ahead of the control group. At the end of one year the 
Lhird experJi!:ental grou^ did not score above control in reading but did 
score 6 Qorths ahead oi the control f^roup in tnath. In that year, the 
treatment group was lightly superior to the control group on the pretest, 
vhich was the Mew York State Readiness Test, so this result is 
questionable. 

None of these five experimental r;tudies were selected by ti!fe panel. 
UsuaJJy tiie reason is because the pretest and posttest were aot the 
same* It is nearly, impossible to design a study with identical tests 
covering the kindergarten-first grade range, since the students cannot 
read at the beginning of that period* Tests are notoriously unreliable 
for students at this a^e* In addition, all five of the experimental 
designs used analysis of covariance models, and relatively little 
information was provided with which to compute effect sizes. Finally, 
all five studies have problems with attrition. It is doubtful that the 
attrition probletrs are more severe in thece studies than they are in the 
longitudinal studies used by the panel; but these studies are usually 
more detailed in describing attrition, making it harder to overlook a 
problem which is in tact present in the majority of longitudinal studies 
of education, Tn general, we do not think that these studies should be 
considered inferior to those chosen by the panel* 

I'here -^re 8 other studies which use what we call "cohort" comparisons 
(and vhich others often call "historical control groups"). These 
studies compared scores of desegregated students in the particular grade 
to the scores that blacks made in the same grade before desegregation 
occurred. This kind of design is the only way to study desegregation in 
a communit^^ where all schools have been desegregated, since no 
segregated group of black students remains to be used as control, Nona 
of these studies have data for a large number of years which would 
enable one to conduct an interrupted time-series analysis* For example, 
the Nashville-Davidson County public schools (1979) published mean test 
scores for black students in each grade for the nine--year period from 
1970, when the desegregation plan was adopted, to 1978, The test scores 
show a considerable gain over the periods ranging from .2 to .4 standard 
deviations. Of course> the probleia is that we cannot attribute this to 
desegregation; it may be due to other changes in testing or educational 
practice in the city* 

One wonders whether a school district would be anxious to publish the 
results if it showed negative effects. Perhaps many other school 
districts have the same sort of data that Nashville has but have not 
released it to interested researchers because it shows declines in 
achievement. But one example which works in the opposite direction is 
from Pasadena, whose school board has been adamently opposed to 
mandatory desegregation and released a lengthy report by Harold Kurtz 
(1975) shoving the disastrous educational consequences of desegregation 
there. In 13 tests of students who were desegregated in grades 2 
through 12» scores were lower after desegregation 14 times. But there 
were very large achievement increases for students who were in 
kindergarten and first grade~averaging ,36 standard deviations. Thus 



while test scores dropped for black students throughout the district 
durinji the period of time after desegregation, test scores of the very 
youngest students went up, Thie could be a peculiarity of the testing 
procedure used with the youngest students, of course. 

Cohort analysis is necessary when a district is totally desegregated. 
Total desegregation in the north came first to university communities » 
the largest of which was Berkeley* which desegregated in 1968, Test 
scores dropped that spring, abcut ,04 standard deviations in reading for 
first graders. By l970t second graders were reading about .16 standard 
deviatioi;*; ^ibove th*^ second graders of 1968. Thus one report 
(Dambachert 1971) shows essentially no change in test scores using the 
first year of desegregation* while a second paper (Lunemann, 1973) shows 
^ positive desegregation effect, (in this analysis black and "other/' 
pTesutcably Hispanics who did not consider themselves whites* were 
combined in one year and separated in others. The percentage of "other" 
students in the district changed radically* however, suggesting that 
these ethnic classifications were unstable. We have combined "others" 
with Bl^icks for all years in order to avoid this problem,) 

Another university town which developed a desegregation plan was 
Evsnston, Jayjia Hsia of TES (1971) carried out a lengthy evaluation, 
and found that in the f^ill of the third grade, two years after 
desegregation* students were testing ,01 standard deviations below 
students two years earlier. She found gains in only 3 out of 9 tests in 
the upper graces over the first two years. 

Another school district which reported achievement test scores for the 
year after desegregation in comparison to the year before was Clark 
county (Las Vegas) Nevada, Test scores for black students were up ,1 
years ^ 

In one southern district, George Chenault (l976) found that students who 
were desegregated in kinder<;arten scored -3 years higher in the fourth 
grade compared to students five years earlier. 

Finally we have constructed a coliort analysis from the data provided by 
Patricia Carrignn (l969)i. The panel treated Carrigan as a Longitudinal 
study* but the "segregated" control school is 50% black — desegregated by 
most people's criteria. We ignored the data for the control school and 
instead compared the performance of the desegregated black students to 
black students at the sending school prior to desegregation. We found 
the ir-tegrated students scoring ,05 standard deviations higher. 

All the cohort studies are subject to alternative interpretations — 
change in curricula* in type of test, in test administration* could all 
affect test scores. On the other hand* cohort studies have the 
advantage of ^ ving i ilatively large sample sizes. They are also not 
likely to be axJected by complicated statistical procedures which 
sometimes do more harm than good. Of eight studies of students 
desegregated at kindergarten or first grade* we found gains in 6* the 
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exception being Hsia's FA'anston study and Dambacher's Berkeley study, 
whose conclusions were reversed the following year by Lunemann,^ 

The final group of studies oi students desegregated at first grade or 
kindergarten ar*i longitudinal studies with non-random assignment. These 
are generally the ircst difficult studies to draw conclusions from, 
because the inability to use accurate pretests with very young children 
makes statistical matchii:g extr^ely difficult. In the two best 
studies, by Louis Anderson (1966) of Nashville's early freedom-of choice 
plan, and toudse Moore (li^Tl) of DeKalb county, GA, the full data was 
provided making it possible for Kahard and me to reanalyze the data, 
both cases we examined student growth during the middle of elementary 
school, comparing growth rates for students who had experienced 
desegregation from kindergarten or first grade tc other students in 
segregated schools in earlier years. One study showed a sizeable 
Increase in the rate of learning while the other study showed a loss 
after desegregation. We were reluctant to take either study seriously, 
since we are not sure how to relate these two studies of growth rates 
several years after desegregation to all the other studies, which 
measure growth immediately following desegregation. Five other studies 
pretested students at kindergarten or first grade and posttested them 
one or two years later. These are usually very brief reports of studies 
with relatively small sample sizes, 

Orrin Bowman's (1973) dissertation evaluates a voluntary plan in 
Rochester, NY, Two experimental groups exceed the controls (both a 
regular class and an "enriched" class) by .18 and ,32 standard 
deviations on a readiness test at grade 1; at grade 3 they exceed the 
controls on an achievement battery by .90 and ,88 standard deviations, 
Bowman^s analysis of covarlance shows net effects of .75 and ,70; using 
the panel's procedure, I get effects of ,72 and ,66, There are only 19 
and 17 treatment subjects, Ann Danaby (1971) compared 41 volunteers for 
desegregation to a control group randomly chosen froiri a segregated 
school. Little raw data is provided. The author uses regression to 
control on the seemingly large pretest differences on the Metropolitan 
Readiness Test, and obtains non-significant positive treatment effect:*. 
Tlie technique used overestimates treatment effects, however. 

Robert Frary and Thomas Goolsby (1979) compare 32 desegregated first 
graders to 77 in segregated schools, using the Metropolitan Readiness 
Test as a pretest and Metropolitan Achievement Test administered at che 
end of first grade as a posttest. There were large differences (on the 
order of .7 years) favoring the desegregated students. The pretest data 
was used to trichotomise the sample before comparing posttest means 
within each group. Elmer Lemke (1979), studying Peoria, Illinois, 
studied 180 desegregated arid 60 segregated black schools five years 
after desegregation began. He used the Metropolitan Readiness Test and 



*A ninth study, from Jefferson County (Louisville) iOf . , shows an 
Increase in black scores in the elementary grades after desegregation. 
See Raymond, 1980, We received it too late to include in our review. 
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the Iowa Test of Basic Skills, and found only one significant positive 
eftect and no significant negative effects; out of a possible ten 
differences; we judged the overall effect as zero, T, G, Volman (1964) 
studie'^ Kew Rochelle, using the MAT to pretest and pusttest desegregatc;d 
and segregated elementary School students and the Metropolitan Readiness 
test to pretest and posttest kindergarten students. He reports no 
significant desegregation effects on the MAT, but significant gains for 
kindergarten students, Ke reports none of the data, however. Of these 
five studies, only BovDan is included in panel^s group of 19, The other 
A studies were rejected either because they used different test^for 
pretest and posttest or because Insufficient statistics were provided in 
the write-up to permit us co computes an effect size. In my judgment none 
of these 5 studies should be considered oi especially good quality. 

Conclusions 

It is stretching a point to argue that the twenty kindergarten-first 
grade studies are the "best" studies, given their wide range of quality. 
They were not selected as models of research, but because they gave what 
we thought were the least biased estimates of the effect of 
desegregation. We do believe that several of these studies are better 
than the average of the panel's selections, which were supposedly 
intended to he the "best," but we are not conducting a prize ccmpetitiori 
for best dissertation* of the last two decades. We are trying to 
estimate the eff<actc; of desegregation* 

Our 20 '*best" studies; include 5 analyses of four different experimental 
designs, all showing relatively large positive treatment effects (the 
median treatment effect size of these experiments is -34 standard 
deviations). We also found 8 '^historical control groups" studies, six 
of which shoved a positive treatment effect and only 1 a negative 
effect; the median effect size was ,12 standard deviations. Finally, we 
found 7 longitudinal studies, five of which showed positive treatment 
effects and only one a negative effect, with a median effect size of 
,24, Consistent positive outcomes on 5 analyses of randomized 
experiments is impre^rtsive. While the other studies are a good deal 
weaker methodologically, their results are also consistently 
positive~ll studies of 15 are positive and only 2 are negative. If the 
principle function of selecting a superior subgroup of studies is to 
find the consistency of results which is masked by error in an 
unselected sample of studies, we believe we did that, and that the panel 
did not. 



*One of the 93 studies, a dissertation by Ann Linney (1979) did win a 
prize from the American Psychological Association; it was not included 
in either the panel's group of 19 or our list of 20* 
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School Desegregation as a Social Reform: 
A Meta-Analysls of its Effects on Black Acaden^lc Achievement 

Norman Miller and Michael Carlson 
University of Southern California 

INTRODUCTION 

This paper addresses the specific question of what effect school 
desegregation has had on the achievement test scores of black children. 
It If; one of a conunon set of pfipers addressing this issue, all prepared 
for the National Institute of Education, All of the papers base their 
conclusions and analyses on the same set of core studies that the panel 
of experts, selected by NIE to perform the review task, have agreed upon 
as meeting certain criteria for inclusion among those to be reviewed. 

Before summarizing the results of these core studies* it is 
important first to put the question itself into an historical context* 
and second* to discuss the criteria for inclusion and exclusion of 
studies $.nd the procedures used in performing the analysis. Then* after 
presenting the findings, their meaning and ?oilry implications will be 
discussed , 

BACKGROUND 

School desegregation was initiated to address a social 
inequity — the impairment of minority children's right to equal 
educational opportunity. The Brown decision required school 
desegregation as a remedy for prior discrimination, declaring separate 
facilities Inherently unequal. It is important to note that in the view 
of Brown , educational outcome is not the issue. Had it been shown that 
blacks in segregated schools performed on standardized tests as well as 
did whites in segregated schools, int ;uality of educfitional opportunity 
would nevertheless prevail according to Brown , This is not to deny that 
the evidence of social scientists that was presented in the case did 
focus on inequalities between black and white children in their 
self-concepts, motivation, and academic performance* In its ruling, 
however, the court seem concerned primarily with the notion that 
segregated schooling ineluctably stigmatized blacks as a social group, 

"Does segregation of children in public schools solely on the basis 
of race, even though the physical facilities and other 'tangible' 
factors may be equal, deprive the children of the minority group of 
equal educational opportunities? We believe that it does,,* to 
separate Negro school children from others of similar age and 
qualifications solely ^^cause of their race generates a feeling of 
inferiority as to their status in the community that may affect 
their hearts and minds in a way unlikely ever to be undone,, *in the 
field of public education the doctrine *separate but equal* has no 
place. Separate educational facilities are inherently unequal. 



ERLC 



92 



StTfcrtryation of white and colored children In public schools hr.s; a 
detrimcntiJtJ effect upon the colored children. The impact Is 
greater when it has the sanction of the ]aw; for the policy of 
separating the races Is usually Interpreted as denoting the 
inferiority of the Kegro group" (Brown v. The Board of Education, 
195i)- 

The f*^ct of educational separation was the problem to be cured; the 
cure was desegregation. In principle, this logic ie simple and 
straightforward; it requires no other major Ingredients (such as, for 
instance, proof that desegregation will eliminate or reduce wage 
inequities, or other specific differences in the outcomes of blacks and 
whites)* Of course, when school desegregation was implemented In 
specific cities and school districts, the method and degree of 
desegregation became important issues * Presumably , In court "mandated 
plana, the extensiveness of a court Imposed remedy should in some degree 
correspond to the severity or magnitude of the acts that created 
segregated schooling (Black, 1960; Rluger, 1977) * 

i\ir**^rlcans are basically sjTnpathetlc to the plight of blacks. They know 
that despite the beneiiclal social changes for blacks that have occurred 
over the past decades, discrimination e::ists and most believe It wrong. 
Most believe that the full weight of the Federal government should be 
martialed in order to eliminate such injustice* Two decades ago, 91 
percent of whites favored equal voting rights, 87 percent fr.vored the 
right to a fair jury trial and non-segregated public transportation, ai:d 
72 percent favored integrated education. Despite the fact that white 
Americans by a margin of 2 to 1 felt in 1966 that black children would 
net be better educated In integrated cla£€;rooms, they had no deep 
aversion to black children attending the same school as their own 
offspring. By a margin greater than 3 to 1 , they dsnled that the 
education of white children would suffer if blacks are in their 
classrootn* Three out of four white Americans approved of the Court 
ruling outlawing segregation in education (Brink & Harris, 1966, p, 
131)* There Is, of course, substantial slippage between belief and 
action* Despite this endorsement of the moral aspects of court rulings, 
most whites may not be inclined to do anything specific about helping to 
bring about Integration in schools* 

In viewing the courts' position, legal scholars have noted that the 
remedy or restitution (viz*, desegregation) was often Imposed on parties 
other than either the perpetrators of segregation (for instance; the 
school board that created it) or on their victl^ns (those who graduated 
from the segregated school system). This characteristic of legally 
Imposed remedies has led some legal analysts to inLerpret the underlying 
legal principle or goal not as restitution to the injured party, but 
instead, as group protection* Child labor laws or minimum age drinking 
laws might be other instances of the same principal* For a discussion 
of this view, see Yudof's (1980) interpretation and oxscusslon of 
Dworkin (1970)* 

Since the time of Brown , social science seems to have concerned 
Itself with the specific effects of desegregated schooling on black 
academic achievement, black self-concepts, and cn Interracial hostility 
and prejudice* Although these three issues were prominent in the social 
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science statement appended lo Brown > they arc not the same as racl;il 
separation and stigmatization. Among tltc three, the one that roost 
closely approaches st igmat izat ion in meaning* or is most directly 
related to it, is intergroup hostility and prejudice. It should be 
noted* however, that hostility and prejudice do not necessarily denot<i 
stigmatiz*ition< Although ingroup bias is ubiquitous in intergroup 
relations* not all or even most outgroups are stigmatized. We 
frequently encounter outgroups in our daily lives. Common examples of 
reciprocal ingroup-out group pairs might be: production and sales 
personnel in a particular manufacturing company; two fraternities on a 
university campus; two teams in a baseball little league; members of 
opposing politiC£il parties; etc- Yet ordinarily* none of these groups 
are stigmatised by each other. 

The point here is that the issues that have concerned social 
scientists* namely* low academic achievement and poor self-concepts 
among black children* if not prejudice as well, are not the causes of 
stigmatizatiun. As implied by Campbell's argument* even if the 
directions of existing difference were reversed* stigmatization would 
persist (Campbell* 1967), The flexibility of our evaluative terminology 
allows any direction of difference to be positively labeled when 
describing ingroup members and negatively labeled when depicting 
outgroups* ("We are firm; they are pigheaded,") Thus* to the extent 
that racial-ethnic differences in academic achievement and self-concept 
exist, it makes more sense to view them as consequences than as causes 
of stigmatization* And if they are consequences, they certainly are not 
the only ones. Other possible consequences are wage inequities* 
inequalities in employment rates, lower voter turnout among blacks* 
higher death and disease rates* etc, 

SOCIAL SCIENCE RESEARCH ON SCHOOL DESEGREGATION 

In their research on school desegregation, why have social 
scientists focused their attention primarily on its effects on black 
academic achievement and black self-esteem? Perhaps in part they took 
their instruction from the emphasis found in the social science 
statement that was appended to the plaintiffs' case in Brown , which put 
impairment of black children's self-concept as the most pivotal or 
central consequence of black stigmatization* and viewed other 
consequences as flr^^ing from or being caused by this key deficiency 
(Stephan* 1978). 

The fact that studies of the effect of school desegregation on 
academic achievement, however, are so much more prevalent than those of 
any other variable reflects two additiOTial factors. First* it 
undoubtedly reflects the face that measures of academic achievement are 
so routinely administered by school districts. Second, such measures 
are very readily seen as central to the educational iiission. This makes 
such studies more appealing to administrators who must approve the 
researcher^s intrusion into school activities and/or records* but also, 
to the public as well. 

The courcs, too, seem to have been responsive to this manifest 
connection. Despite the fact that some research suggests that education 
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contributes relatlv^y little to one's life outcomes (Jencks* Smith, 
Bane, Cohen, Gintis, Heynes, & Michelson, 1972), the California State 
Supreme Court (Crawford, 1975) viewed desegregated education as a means 
of increasing the social mobility of minorities, presumably by providing 
better education and higher levels of cognirlve mastery to minority 
students. Yet, Cook (1979), who was one of the authors of the social 
science statement appended to Brown , states that It "nowhere predicted 
imprcvement In the school achievement of black children as a consequence 
of desegregation" (Cook, 1979), Nevertheless, it is clear that courts 
as well as social scientists, have been interested not merely in the 
fact of segregated schooling, but also, in the effects of desegregated 
schooling on minority children. 

Two problems have made It difficult for social scientists to 
provide answers about the effect of school desegregation. The first is 
the amh^uiu^in the meaning of the term "school desegregation," The 
secontl^^ems^^Hp the quality and characteristics of the research 
designs used to study it* 

The definition of school desegregation * At first thought, the 
meaning of the term "school desegregation" seems straightforward* An 
analyt;is of how school desegregation has been implemented in any set of 
communities or cities;, however, reveals substantial variability. Thus, 
the meaning of the term is in fact vague. The only common definitional 
element among studies of its effects is that the ratio of minority and 
white students in a classroom or school has been altered. By how much? 
Are the whites in a classroom more or less numerous than the blacks? Is 
the percentage of minority students in the class or school changed from 
98 percent -to A5 percent? Are the changes in percentages made in all 
classes, or just at certain grade levels or programs within the school? 
Are both groups of children shifted to new schools or is just one of the 
groups? Is the teacher familiar to one or both groups ^>f students or do 
the students have a new and unfamiliar teacher? Do both groups retain 
friends from the previous year in their class? To what extent have 
other important factors other than the ratio of white to minority 
«;tudents also been altered (e-g*, the curriculum, the student teacher 
ratio, the quality of physical facilities, the quality of teaching 
materials, the quality of teachers, etc*)? 

The problems created by an ambiguous definition can be illustrated 
by an analogy. Consider the question "Is eating food good for humans?" 
Although on first thought the answer is obviously "yes," we can quickly 
see that th^ answer will depend on what is eaten and how. If the 
chicken salad has "turned", or the plate it is served on is 
lead-contaminated, then the answer becomes, "no," If a child is fed only 
an ounce of food three times a day or the food is merely rubbed on the 
child's stomach, it will stai-^. It might also starve if the only food 
available were unpalatable (e,g, , half-digested dog food taken from a 
dog's stomach), A nutritionally balanced high-protein drink may sustain 
life but also cause one's teeth to drop out. Extended hospitalization 
for malnutrition might give one bed sores. 

The examples above a::e not the "ordinary" instances of eating. But 
what are the "ordinary" instances of school desegregation? There are 



numerous circumstances in which few would expect desegregated schooling 
to produce academic gains for blacks; e*g*, when teachers, students, or 
principals in receiving schools are prejudiced against blacks (the food 
is poisoned); when there is only one or two of them in classroom* or 
when they are ignored in the classroom (too little food to provide 
nourishment); when the curriculum is not modified to match their current 
performance level* and consequently is not assimilated (food is rubbed 
on their stomach); when they are made to feel rejected and incompetent 
(the food is unpalatable)* On the other hand* it may produce academic 
gains but f simultaneously* as a consequence of exposure to higher 
performing classmates* lower their academic self-concepts (bed-sores)* 

Americans may feel it is better or more moral to ship government 
overstocks of potatoes to an undernourished thirds-world country than to 
dump them in the ocean* As we have learned in the past, however* 
shipping food to people is not the same as nourishing them* Potatoes 
won't help if they arrive rotten* or if the receiving country lacks 
adequate mechanisms for distributing them* Nor will they help if 
protein deficiency is the problem* But nevertheless* despite our 
failure to achieve the goal of nourishing a famine-plagued third vjorld 
country* we might feel righteous about our efforts* 

Simply put, many factors are relevant to school outcomes* Those 
factors that go hand in hand with desegregation in one setting may not 
in the next* Consequently, the meaning of the term varies from one 
study to the next, and often* in ways that are important but not well 
documented. 

Research designs in studies of school desegregation . As indicated* 
a second problem in assessing the effects of school desegregation is 
that researchers have rarely used a methodology that permits inferences 
about what it was that caused some observable J^fference between 
comparison groups (segregated and desegregated students)* This issue is 
quite separate from the previous one, which pointed to the variation in 
the meaning of the term desegregation and covariation of other factors 
with implementation of a change in the ratio of blacks to whites in a 
school* It refers instead to the fact that children, classrooms* or 
schools are almost never randomly assigned to comparison conditions* As 
a result, one cannot know whether initial differences between the groups 
account for (or cause) the differences found after the treatment 
(desegregated schooling) * 

Experts are agreed that attempts to select out from* (a) those 
students who continue to have segregated schooling and (b) those 
students who change to desegregated schooling* two subsets of children 
that are matched (or on the average equal) on key variables on which 
they were originally matched* they will again differ from each othi^r in 
the direction in which they initially differed*^ Simdlarly, ths*y will 
also differ on variables correlated with the variable on which they were 
matched. Consequently* if, xor instance, a high IQ implies better 
ability to leam, and if prior to their desegregation the average IQ of 
the desegregated students exceeded that of those who remained 
segregated, they might well perform better after desegregation* Such a 
difference might just as readily be attributed to the initial 
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difference in IQ as to the difference , in type of schooling. Why might 
students with higher IQ's naturally appear more frequently in the 
desegregated group? Parentis and children who are -brighter may be more 
motivated to seek out better schools. If they believe desegregated 
education to be /superior, they will push to be in that program* to be 
included sooner in the desegregated group* or to be assigned to the 
desegregated school, etc, (e,g, , Gerard & Miller* 1975) • 

METHODOLOGICAL CONSIDERATIONS FOR SUMMARIZING THE 
NIE SET OF STUDIES 

P ROCEDURES FOR COMBINING THE RESULTS OF STUDIES - ' 

Several different methods exist for suinmarizing the outcomes of a 
group of studies. Recently these procedures have/come to be called 
meta-analysis (Glass, 1976)- One procedure Is simply to tally the 
number of studies giving positive versus negative effects. This box 
score or voting approach Is crude because it falls* for Instance* to 
acknowledge differences among studies In the strength or magnitude of 
difference between comparison conditions. Almost no experts now 
advocate the voting' method alone (Hunter, Schmidt, & Jackson, 1982), 
Furthermore, the voting or box score method can lead to erroneous 
conclusions due to "'false' conflicting results" In the literature 
(Hunter et al. p. 132). 

r 

The 2-score method provides an alternative procedure for 
representing the size of the relationship between the treatment variable 
and the dependent measures 'in a given study,- It requires computing the 
exact £^ of the statistic employed by the original researcher (and 
dividing it in half if ,a two-::f ailed test was employed ) and then 
converting each £^ value to an exact £^-score, based on the normal 
probability distribution. The sum of these £;-scores across studies Is 
then divided by the square root of the number of findings Included to 
generate an overall 2^-score and Its associated probability level* This 
provides an estimate, of overall statistical significance* assessing the 
likelihood that the results of the entire pool of studies reflect chance 
outcomes. (This particular procedure . typically understates significant 
effects because many authors do not include specific t^, ^t.or values 
m their , research reports, and :as a result, nominal rather than exact £^ 
values have to be entered into the analysis,) With this method, a 
fail-safe n can, be calculated to determine the number of additional 
studies with sunmied 2^-scores that total to ^zero which would be needed 
before the probability value Associated with the overall £ would exceed 
the ,05 level. 

The effect size method is the most preferred method and the one used for 
this paper. In this method* the difference between the means of pairs 
of treatment conditions in each study is divided by the wlthin-group 
standard deviation of the outcome measure employed, thus yielding a 
standardized mean difference score (Glass* 1977), These difference 
scores can then be averaged acxoss studies in order to generate an 
overall effect size estimate. 



EVALUATING THE STRENGTH OF RESEARCH DESIGNS 



Apart from generating summary estimates of overall effects, 
meta-analysis procedures can in principle be utilized to assess whether 
characteristics of research design and/or program iiifplementation 
features are related/to program effectiveness. For this purpose* 
characteristics of ySubjects* studies* and programs must be coded and 
then entered as predictors in multiple regression analyses, with 
estimates of si^e of effects as the dependent variable. Examples of 
such predictor variables might be factors "such as age of program 
recipients., nature of the experimental design employed in the study* the 
extent of parental involvement in the program* etc. In general* the 
search for such predictor or moderator variables is highly prone to 
-Capitalization on chance unless the number of studies is very large. In 
the presents case, many statistical experts might judge the number of 
studies as too few to justify application of this procedure, 

The^study selection criteria imposed by the panel attempted to 
eliminate particularly v'eak studies from coTisideration, This does not 
mean that all^:0'r^ even most -studies Chat ' survived the weeding out imposed 
by application of the minimum procedures are strong studies. They are 
not. And typically;' studies with weak research designs show stronger or 
more positive effects than do those with stronger designs* For 
Instance, in a meta-analysis of the larger body of school desegregation 
research concerned with achievement test performance* Krcl (1978) found 
an average effect si^e of +0,21 among studies with weak designs* whereas 
among those with stronger designs* the effect was reduced by half 
(+0,10), While the effects of several design factors (threats to 
validity) have been found to be negligible in some educational contexts 
(Walberg, 1981), their influence nevertheless should be assessed 
whenever meta-analyses are undertaken in any new research arena. By 
Imposing the selection criteria that we did, however, most of the 
variation in strength of design found in the total set of nineteen 
studies on school desegregation and academic achievement has i>een 
eliminated* 

As indicated' above* in addition to analyses involving research 
design considerations* it is ordinarily important to separate studies In 
terms of variables associated with the strength of program 
^implementation f For this purpose, studies ideally should be rated or ^ 
classified on implementation variables independently of knowledge of 
^heir outcomes. Unfortunately* the studies analyzed for this paper do 
not provide much Information on correlates of (or strength of) the 
implementation of desegregation. Moreover* it is not even clear what^ 
'^strength of implementation" means with respect to school desegregation, 

VARIATION IN NUMBER AND TYPE OF DEF^DEHT MEASURE 

' In the subset 'of studies analyzed for this report, the specific 
dependent measure varies from one study to the next. Not only do 
studies use different measures of ^ verbal achievement* but within the 
same study the measure used prior to the implementation of desegregation 
may differ from that used later. In addition* some studies also include 
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measures 9^ achievement inW^thematicSt science, and other subjects, as 
well as verbal achievements 

Does it make sense to try to. summarize studies vhose measures of 
verbal achievement, differ from one study to the next? It depends on the 
situation'or problem. Although* for instance* it m^y make perfect sense 
to distinguish between vocabulary mastery and reading comprehension for 
t$ome studies of educational success* in the present case there is little 
or no theoretical reason to expect school desegregation to differ in its 
impact on the two. In other words* with respect to the issue of whether 
school desegregation affects black academic achievement* different 
measures of verbal performance are^ conceptually interchangeable in that 
they all tap some aspect of the verbal component of the academic 
curriculum* - ^ 

For the same reason, the distinction between measures of verbal 
achievement and mathematical (and/or other academic areas such as 
science) caii also be ignored* being merely another instance of the same 
issue; again* there appears to be little theoretical reason, to think 
desegregation might affect the several areas of mastery differently. 
This line of reasoning argues that a single effect size be computed 
across studies regardless of variation across studies in the particular 
dependent measure (e,g,, vocabulary* reading comprehension* mathematics* 
social studies, etc). 

In addition to variation among studies in their dependent measure* 
many studies report outcomes for several dependent measures* In this 
case* we are not dealing 'just with variation across studies in their 
, dependent measure* but with multiple outcomes on the same set of 
.children* H&re, the ideal procedure would convert the two sets of 
scores on each child (math and verbal achievement test score) to 
f^tandard scores which would then be averaged for each child. The effect 
si:iF. for .each study would then be computed on these averages. This 
results in each study contributing one value to the meta-analysis and at 
the same time minimizes error of measurement* Unfortunately* in the 
present instance this cannot readily be done because the raw score 
information is flot av^flable*^ To ignore the issue and treat the 
separate outcomes In math and verbal performance obtained in a single 
study as separate entries in the meta-analysis ignores the fact that 
these outcomes are. not independent. Although not periEectly Ideal^^ the 
best solution is to average the two effect sizes. This assures that 
studies with more measures are not given greater weight than those with 
few (or none) , 

MULTIPLE SUBJECT GROUPS 

The same logic applies to the analysis of subgroups of multiple 
groups with the same study, ^ The Ideal procedure is to use an overall 
test across all subgroups* If this is not provided by the ^individual 
researcher* then the best alternative is to average the effect sizes 
computed for each subgroup. 
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CRITERIA FOR INCLUSION 



Appendix A lists the criteria agreed upon by the NIE. panel as a basis 
for inclusion of studies to be analyzed. These yielded a core sample of 
19 studies. Only studies included in the KIE core sample were 
considered appropriate for meta-analysis. This requirement provides the 
first entry in Table 1, which details additional inclusion criteria for 
the present study. Given this set of core studies, a further criterion 
is that the proportion of blacks in the segregated control group must 
exceed 50yE, This provision serves conceptually to tighten the notion of 
"segregation", and insures that the proportion of control group 
non-blacks in some studies wil! not approach the experimental group 
non-black proportions which are represented in others. The studies bv 
Carrigan (1969) and Thompson & Smidchens (1979) were excluded from the 
analysis by this criterion. 

The second part of Table 1 provides the guidelines for including 
the various segregated-;desegregated comparisons which are contained 
within the 17 selected studies. The first restriction is that the Ns 
for both segregated and desegregated pre-and post-tests must be at least 
10, This sets at least a moderate lower bound on the reliability of the 
estimates of sample means and standard deviations, as the precision of 
such estimates increases with sample size. Very small samples 
occasionally yield standard deviations which are only a fraction of the 
population value, and thereby ^re capable of producing highly misleading 
effect size estimates, A second incljusionary restriction on the 
particular comparisons concerns segregated control groups exposed to 
"enriched" or other novel types of curricula. Such control groups are 
not used because the resultant effect size estimates inversely reflect 
the efficacy of the particular special treatment employed in the 
"control" group. Such a situa^iion fails to produce an acceptable test 
of. the effects of desegregatioT^ on black achievement. 

As indicated earlier, standardized achievement andiability tests' of 
specialized content areas (e-g»t social studies, science), as well as , 
verbal and mathematical achievement, were included in the analysis, IQ 
comparisons were eliminated on the grounds that, in theory, a student's 
level of intelligence should not be especially sensitive to classroom 
experiences. Additionally, tests of "work study skills'' vere excluded 
because they do not correspond to any major academic content area, A 
further restriction .noted in Table 1 is that the pretest and posttest ; 
had to measure an identical construct (e,g,, "vocabulary", "arithmetic ^ 
concepts"). Usually, this meant use of the same standardized tests_ 
(e,g, , IOWA, Stanford, etc, — corresponding to the app,^opriate grade 
levels) for both the pretest and the posttest. However, cases in which 
the pretest ard posttest differed, but nonetheless assessed the same 
construct, were also included, with the pretest means being adjusted to' 
correspond to the posttest scale. 

As noted in a preceding section, in studies of school desegregation, 
researchers are rarely able to assign children randojnly to experimental 
and control conditions. The selection effects that- occur sometimes 
result in higher test score means and laxg^er standard deviations in 
experimental than in control groups prior^feo the onset of degegregated 
schooling , ' Therefore, it is important to attempt to correct " 
post-measured differences so that they dc not simply reflect the initial 
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A. Criteria for inclusion of studies: 

1. ^ Study must be included in NIE core list, 

2* Segregated control group siust be over 50% black* 

B* Criteria for inclusion of comparisons within studies: 

1* ll£ must be larger than 10 for both segregated and dese* 
gregated conditions. 

2. Segregated control group must not receive any special 
treatments which extend beyond the typical classroom experience 
(e.g. ^enriched* control classes are excluded) . 

3. Dependent variable must consist of a verbalf math, or 
Mother* (e.g. science^ social studies) achievement or ability 
test which corresponds to a major content area (excluded are IQ 
tests and Vork study skills* tests). 

4. Pretests and posttests must measure an identical con* 
struct. 

5. Either; 

a. Posttest standard deviations'^ (or reliable estimates from 
national norms or a cbmparable study) , along with pretest to 
posttest mean differences for segregated and for desegregated 
conditions^ must be. present; or 

b. An ANCOVA table (with pretest differences as a covari* 
ate) which reports a ^ or an £ value for segregated vs. 
desegregated posttest score differences must be present. 
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inequivalence of the comparison ^groups , but instead, reflect the effect 
of desegregated schooling. 



In order t,o arrive at pretest-adjusted estimates of .effect size, it 
is necessary to possess the following information: (1) . an estimate of 
differential experimental vs, control group pretest/posttest gain 
scores; and (2) an estimate of the population standard deviation, Thust 
the final criterion fdr inclusion lifted in Table 1 is the presence of 
these two pieces of information. These numbers typically were furnished 
in the form of tables containing pretest and posttest means and standard 
deviations for both segregated. and desegregated grpups. Analysis of 
covariance summary tables (with pretest differences as a covariate) 
provided an acceptable alternative source of such information. Finally, 
in the absence of the above sources of information, a comparison cruld 
still be included if the pretest and posttest means were reported a.?d if 
the standard deviation could be estimated from, cither national norms or 
from a" comparable study using the same test for the same grade-level, 

COMPUTATION OF EFFECT SIZE ^ 

The calculation of effect size estimates for the included comparisons 
was. achieved via the following"" formula: 



_ ^(post) " ^C(post) \(pre) ~ .^C(pre) 



S . C ■ * Z C 



E = Experimental (Desegregated) Group 
C = Control (Non-Desegregated) Group 

Effect si26 is defined here as the posttest desegregated vs, segregated 
difference in means (as expressed in pooled posttest standard units) 
minus the pretest desegregated vs, segregated difference in means (as 
expressed in pooled pretest standard units). For the estimation of 
population pretest arid posttest standard deviations, a pooled figure is 
used (in prefer^ence to Glass' racommendation of using only the control 
group standard deviation) in order to increase the reliability of such 
estimates* The soundness of using a population estimate based on a 
pooled figure lies in the fact that preliminary tests indicated that 
among the NIE core studies* no overall significant difference was 
present between the standard deviations of the desegregated and 
segregated groups* at either the time of the pretest or the posttect; 

Fan-Spread , It is important to note that the present effect size 
estimation procedure eliminates any interpretative problems stemming 
frocr the "fan-spread hypothesis," According to the fan-spread notion, 
a widening of the difference between' group means over time will te 
accompanied by an increase In the within group standard deviations. 
This Implies that the dafference between 'two group means may grow over 
time .in the absence of any increment in the correlation between the 
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atment and the depende/nt variable (Kenny, 1975). The effect size 
rmula used in this study* by separately^ standardizing the difference 
etween means at times Ji and T2 , permits a determination of the 
extent to which desegregation is associated with improvement in academic 
achievement over and above mere fan-spreading* The/ computational 
procedure is identical/ to that used by Armor (1983/ for those cases in 
which he judges fan-spread to be present* In other cases* however, ^ 
difference arises, in that Armor pools the four estimates of standard 
deviation in instances 'in which he Judges that fan^spread does not 
exist * * / 



Amor's procedure contains two problems* First*' fan-spread is a 
Toatter of degree • ; What criteria should be used to ntake a dichotomous 
judgment of "present" or "absent" and how, can/such a dichotomous 
decision be justified? A statistical test of whether standard 
deviations differ in a particular instance is not a satisfactory 
criteria, in that it sensibly could be argued that correction should 
also be' made when differences fall just short,^ or somewhat shori^ etc* 
of statistical significance* / 



A second problem Is that Armor^s procedure may systematically, place 
undue weight, on pretest differences* If /it'assumed that fan^spxead 
effects do not occur, (or do not all of /the time), and further* that the 
distribution of pretest vs, posttest standard deviation differences is 
associated /with a certain degree of sampling variance (which is 
'particular/iy likely here due to small sample sizes), then sampling error 
alone will produce a set of instances^' in which the pretest standard 
deviation'' is Below the posttest standard deviation. This suggests that 
Armor^s procedure may be susceptible to a bias in which only pretest 
standard deviations that happen to be low will be used to specifically 
scale pretest inean differences, while those ^hat are feigher (relative to 
the posttest standard deviation) will be averaged in witlri the post,test 
estimates. The net result is that pretest differences mVy be given a 
disproportionately high weighting/ across cases* Because the 
desegregated group usually shows/a higher pretest mean than the^ 
segregated control group* Armor fs procedure consequently can be expected 
to produce a lower overall estimate of effect size than tht formula that 
we use, / 

/ ■ ' . 

In/ order Xo assess the extent/to which a consideration of 
fan-spreading, however* is important in accounting for the results of 
the current sample of desegregation studies* effect size estimates were 
also calculated by using an /alternative formula; 



/ 



post) 



- ^ 



tore) 



) - (X 



Ctpost) 



C (post) 



V2 



E^Experiinental/ (Desegregated) Group 
C«Control (tfon Desegregated) Group 
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In this formula, the desegregation ve, segregation pre-post gain 
score difference is divided by an estimate of standard deviation that is 
based on the pooled posttest figures. If the pretest standard 
deviations tend to be low relative to those of the posttest, and if the 
desegregation group tends to possess a higher mean than the control . 
group at the time of -the pretest (as is the case when the fan-spread 
hypothesis holds), then this formula should produce larger estimates of 
effect size than should the first formula. This is true because the 
typical pretest advantage for the desegregated students, which- is 
subtracted from the standardized posttest difference, will be weighted 
more heavily in determining effect size estimates. 

Effect size estimates based on analysis of covariance * For . cases 
that only reported an AMCOVA (Analysis of Covariance) suinmary table, in 
which pretest scores served as the covariate, the following 
transformation procedure was used to estimate the effect size; 

2 

ES » t {*633) 

where N is the combined sample size. Multiplying by ,'633 serves to 
correct for the fact that the vajriance of change scores tends to be 
lower than the variance of raw sample scores: 

( $2 2S^(i-r) as reported by Armor), with the difference 

change 

being greatest for cases involving high pretest-posctest reliabilities. 
For the present purposes, a fairly high reliability estimate (r-,8) was 
assumed, which algebraically leads zp the modification of effect size 
noted above. . * 

Sample size . Some experts (e,g,. Hunter, et al*) argue that a summary 
statistic of the effect st2es cotnputed for the sample of studies (viz,, 
mean effect size) should be weighted by the sample size of each study. 
Though there often may be good reasons to adopt this procedure* 
especially when summarizing experimental studies, for several reasons, 
it will not be used here. In experimental research, the manipulations 
are designed to correspond to a theoretical variable* Researchers 
almost routinely use manipulation checks to assess whether or not the 
independent variable theoretically postulated to affect the dependent 
measure has In fact been manipulated by the experimental operations that 
were employed, and If . so, to , assess whether it was manipulated *'strongly 
enough," If, in a particular study, the manipulation check failed to 
confirm appropriate variation of the independent variable, and in 
addition, there were no treatment effects, no sensible scientist would 
want to include the study in the meta-analysis, 

> ' - * 

'?r^|)t-jffiintrast, as argued above. It is .not clear what, if any, 

theoretical variable corresponds yto or is conceptually linked to a 
change 'in tl)e ratio of black ancr white children in. a classroom (or 



school) and consequently* might be responsible for black achievement 
gains. Indeed, as indicated later in this paper* research seriously 
impugns any positive role for the one theoretical process postulated in 
the past to cause academic gains for minority students. Not knowing 
what underlying theoretical variable is relevant to academic gains for 
blacks, it makes perfect sense that such manipulation checks simply are 
not found in desegregation research. Consequently, one cannot know 
whether or not in any particular study, the desegregated groups were 
exposed to the "key ingredients," If a study with' a very large sample 
fails to contain these ingredients (or contains other features which 
produce losses in black achievement) , and if this study outcome were 
weighted by its sample size* it might more than counterbalance the 
effects of other studies, which with smaller samples* produced positive 
effects,' (In this regard, it is noteworthy that sample sizes among 
studies in the NIE core set vary by a margin of fifty to one,) Stating^ 
this another way» extraneous factors related to sample size* which may 
or may not be causal* may be correlated with effect size. 

Anticipating the results* analyses show that; (I) sample size is 
indeed negatively correlated with effect size Cr= -,404) and (2) the 
observed variation among effect sizes exceeds that to be expected from 
sampling error* suggesting that moderator variables are in fact 
operating. Taken together, these considerations argue strongly for the 
decision to weight study outcomes equally, rather than by sample size* 

Correction for unreliability . In the current analysis* each effect 
size estimate was corrected for unreliability (following the procedures 
of Hunter et al, * 1982), Measurement unreliability has the effect of 
artificially inflating the variability of scores* thereby leading to 
larger standard deviations and, hence* lower^ absolute values of effect 
size estimates. The unreliability correction procedure advanced by 
Hunter* et al* , divides the estimated effect size. value by the square 
root of the reliability coefficient of the dependent measure. In some 
of the c^ses. comprising the NIE core studies* reliability coefficients 
were either reported directly or were readily available from national 
norms* For the remainder* a conservatively high reliability estimate of 
,93 wasj^autontatically assumed for each test* The net result of 
correcting for unreliability was to increase the absolute value of the 
particular effect size estimate by about 1*3% to 3%, 

RESULTS 

The results of the meta-analysis are summarized in Table 2* For 
each study* a mean was calculated (when possible) for each of the three 
types of dependent variable categories (i*e,* verbal* math, and * 
*'other"). Next to 'each' mean* in parentheses, is the number of different 
tests that were averaged i:i arriving at the figure. 
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BEST copy AVAILABLE 



Using formula (1), the overall effect size is +* 159 (see bottom of 
column 1, Table 2), This estimate weights results within each study 
equally and weights each study equally. The fact that formula (2) gives 
an outcome of +,155, which is essentially equivalent to that obtained 
with formula (l), confirms the view, presented earlier* that fan-spread 
is not^ a problem in these data. 

For purposes of comparison, the effect size computations of Armor 
(1983)=,.:'stephan (1983), and Wortman (1983) are reported in. the adjacent 
columns! of Table 2 (columns 3, 4, and 5), Table 3'summarizes the 
findings^ of all four researchers* reporting their mean effect sizes* 
separately for verbal and math tests, for each study. Pooling the 
outcomes across researchers and studies* the effect size of +,156 for 
verbal tests is significant (t=2,26, p <; ,05)* as is the pooled verbal 
and math effect size of +,119 (t=2,40* ,05), The effects of 

desegregation on mathanatics tests is smaller than that found on verbal 
tests (though not significantly so) and when tested separately, does not 
yield a significant effect si^e (see columns 1 and 2, and see Table 3), 

Sources of Disparity in the Effect Size Estimates for Individual Studies 

Comparisons of our own effect size computations with those of Armor* 
Stephan* and Wortman for each study reveal that they agree fairly well; " 
the correlations* using estimates based on formula (l) are +,87* +,76 
and +,7A with Armor* Stephan, and Wortman* respectively. 

The correlations were computed by treating the mean verbal effect 
size per study and the mean math effect size per study as separate 
entries. The fact that the verbal and math effect size estimates arc 
not based on independent samples is irrelevant for this computation in 
that it seeks to assess the comparability of effect size computations 
performed by independent investigators. There is little reason to think 
that computations performed within a study are less independent than ' 
those between studies, Despite the high correlation between estimates* 
the fact that these correlations are less than perfect* as well as the 
fact that inspection of effect sizes across the rows of Table 2 
reveals variation* makes it clear that computational differences exist,. 

The following paragraphs* on a case by case basis* examine all 
instances in which our estimates differed from the mean estimate of 
Armor* Stephan* and Wortman by more than , 1 of a standard deviation, 

Anderson (Math) 
* 

Our estimate is slightly higher (+,669) than those of Armor (+,5A) and 
Wortman (+,53) * mainly as a result of discrepancy between the mean of 
the raw pretest segregated math scores contained in Table 26 (45,093* .p, 
138) and the mean he presents in his pretest summary table (43,82* p, 
144), We used the mean of the raw scores, which led to a higher effect 
size estimate due to the inclusion of a larger segregated group pretest 
figure, * , 
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Beker (Vorbal) 



The major reason for our higher estimate seems to be our inclusion 
of a wider array of tests (spelling, word meaning, language, and 
vocabulary) which demonstrated larger positive effects than did 
paragraph meaning, Wortman's estimate is additionally lower due to his 
exclusive use of the "refused transfer" controls instead of the 
"requested transfer" group, 

Klein (Math) 

OiiT estimate for math agrees with that of Srephan (+,33), but is 
substantially higher than Armor's (-,08), The reason for the 
discnepancy is that we used only the "random" control grdup, while Armor 
used only the "matched" control group. The matched controls were 
excluded from the present analysis because the' corresponding AKCOVA 
summary table mixes the data^ f or the segregated and desegregated blacks 
along with that of the white students, 

Rentsch (Verbal) 

Our verbal effect size estimate, thoagh quite close to Stephan, is 
lower than that oi Wcrtman, This is primarily due to Woftman's use of 
the "abnormally low" pretest standard deviations (see in particular the 
control group). His use of Glass' formulas creates this outcome. Our 
own formula #2 outcome, which lacks sensitivity to temporal changes in 
standard deviations* yields, as expected, a result much closer to 
Wortman's, 

Savage (Verbal) 

Our estimate for verbal achievement (^-,08) is both lower than and 
in the opposite direction of the mean of the astimates of Armor, 
Wortman, and Stephen (+,L17), ^The soje reason for this appears to be 
our inclusion of STEP Writihg (+,048) and STEP Listening (-,437) ih^ 
arriving at a verbal effect size estimate. Our figure for Reading 
(+,150) agrees perfectly vith Armor's estimate and differs from 
Wortman*s by only ,01, 

Slone (Verbal) 

Our estimate of ,091 is somewhat lower than that of both Armor 
(+,27) and Stephan (+,19), ;This Is because in addition to Reading 
(+,242, which is fairly close to the other estimates) we included the 
Language Skills test (-,061"), 

Syracuse (Verbal) * 

Our figure for the Syracuse report C+,69l), while relatively close 
to Stephan's estimate (+*75)t is much higher than Armor's (+,375), The 
reason is that ^Armor includes a second comparison (which we excluded 
because of missing standard deviations) in which the e^^ect size was - 
essentially zero. 
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Van LVery (Verbal and Math) 



Our estimate for verbal achievement (-1166) is somewhat less negative 
than the estimates of Armor (-*A6) and -of Wortman (-*AA)* This is 
because they only consider Re'.ding (which we estimated to be -.468) » 
whereas we additionally inclu ^d Language ArXs (+*137)* 

Our math estimate is nearly identical to those of Armor and 
Wortman, and differs significantly only from Stephan's figure. 
Stephanas lower estimate most likely stems from his use of Glassian 
formulas, in conjunction with his correction procedure for' the amount of 
time elapsing between the pretest and the posttest. 

Walberg (General Note) 

Due to problems ^in the legibility of our copy of this report, we 

were unable to calculate a verbal effect size estimate for the 10-12th 

grade groups as well as any estimates for math achievement. 

Sources of Disparity in Overall Effect Size Estimates 

Among the three NIE panel members' computed effect size estimates, 
Armor^s overall effect size estimate of 4.077 is most discrepant from 
our own. Consequently, his computations «jere chosen as a basis for 
estimating sources of discrepancy. 

Table i presents an analysis of the disparity. It shows that correction 
for unreliability in the dependent measures is not a major contributor 
to our higher estimate. In part, this is due to the fact that 
conservatively high reliability estimates (viz.^ .95) were assumed for the 
studies for which no reliability was reported. Reliability estimates 
provided by test publishers do not report separate reliability estimate 
for blacks, but were they available, they are likely to belower than 
those reported for whitest In sum, a less conservative and more 
realistic correction for unreliability would yield a larger* more 
positive overall effect size estimate. 

The factor responsible for the largest portion of the difference 
(approximately 45%) was our inclusion of results on achievement tests on 
content other than verbal skills and mathematics. It is worth noting 
that although only three studies report such results, the mean effect 
size (and its standard deviation) is substantially larger than that of 
effect sizes based on verbal and mathematics tests. 

Moderator Variables V ' 

Ordinarily, with such a small' set of studies, it is hard to justify 
a search for variables that explain the relation between the independent 
(school desegregation) and dependent (academic achievement) variables. 
A simple-set of computations, however, can suggest whether such a search 
will be fruitful. The variance of the effect sizes- over the sample 
studies can be computed and corrected for sampling 6rror. If the effect 
sizes are really identical and vary only because of sampling error 
(i.e.t they are simply random deviations from the true mean value), then 
the "true variance" of the effect sizes would be zero. Hunter, et al. , 
provide formulas for computing the variance of an array of effect sizes., 
corrected for sampling error. When sampling variability ( crf_,_,-_, ) 
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Table 4 

Analysis of Discrepancy Between Effect Size 
Estimates of Armor and Miller and Carlson (fl)^ 



Source 



Contributions 



Inclusion of Reliability Correction + .005 

Inclusion of Rentsch * .008 

Inclusion of .•other* category data + .0358. 

Averaging in of extra tests excluded by Annor ^ -f- .002 

Calculational differences on same non^Ancova. cases + .006 
Calculational differences on cases where we 

estimated from Ancova - .006 

Different comparison groups used in same study (Klein) + .0172 

Armor "-s-incirtteitm— of— eax-rirg^air-S-tudy + "OOS 

Cases within studies included only^ by Armor + .Ci22 

Total: « + .079 

(Miller and Carlson + .159) - (Annor + .077) = + .082 

Unaccounted difference « + .003 



Notes 

a; "Table entries are based bn overall means of Miller and 



/ 
/ 



Carlson's Verbal, Math, and "Other" tests. 
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renioved from the computed variance among obtained effect sizes ( ^ zs ) 
there should be no residual ( vij,o ^ • ''ertot * ^ ) fact^ the effect 

size is really, the same across studies, If» x>n'tlie other hand» the 
residual variation is larger especially if large in coioparison to the 
mean value ^ a search for moderator variables should be made. 

In the present case» our effect sizes for verbal achiev^ent tests were 
used to assess this issue. When ^sampling vaxiability is removed^ the 
residual variance does not approximate zero, ; ^ 

( ^ - ,038;(? - -012 ' ) 

These results show that 68Z of the variance in the computed effect size 
jscores (weighted by sample size) is unexplained by saispling error. 

Proportion of variance 

which is unexplainable on « Variance ES - Variance error ,026 

the basis of sampling error, 

Variance ES ,038 

These results argue strongly that variation amoirgs^study characteristics 
and not m6re sampling fluctuation is responsible for^the observed 
variation in the computed effect sizes, ' 

Given these results* three potential moderator variables were esc^mined; 
year of study ^ region (North vs. South) ^ and percentage of black 
students in the desegregated class. Prior to computing the correlation 
betueen^.effect size and each potential moderator variable^ we averaged 
our own effect size estimates with those of Armor^ Stepban^ and Wortman^ 
separately for verbal and math achievement. Pooling gives a more stable 
estimate. Although earlier in the chapter we argued that the different 
consent domains of academic performance should be considered indices of 
a comtson underlying constructs separate treatment of verbal and math 
effects ±s justified by the low correlation between these two effect 
size estimates within each study (r= +,29; r^»+ ,084; df * 12; p>,05)^ 
and the fact that Stephan provides a^ theoretical rationale for different 
outcomes on verbaL and math tests, Vhen the verbal aifd math effect 
sizes of Armors Stephan^ and Wortman are pooled with our owns the 
correlation between th^ is even smaller (r*+,15; r% + ,023; df- 12; p> 
,05), ' ^ ^ ^ 

Interestingly both verbsl and math effect s±t& estimates. correlate 
negatively^ with year of study .(iyS»,554 and i;,»*,559,^p*,05 
respectively. Region is unassociated with effect size (point biserial; 
rv«+,l04; r^^ + ,ip4i north positive, p>, 05)* 

There is some suggestion^ however^ that percentage of blacks in^the 
classroom is important and that it .has different effects on verbal ^nd 
math achievement. The correlation between percentage of black students 
in the class and verbal effect size is •,281, In contrasts no such 
effect is found for math achievement; in fact» the correlation between 
percentage black and math achievement* though not significants is 
opposite In sign (^*3lO)* Vhen year of study is partialled out> the 
above correlations for verbal and math are equal to -*339 and +,422 
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respectively; the difference between them is significant (p ^ ,05t 
one-tailed) , . 

These results provide some suppor^t for Stephan's (1983) interpretation 
of his own computed effect size differences for verbal and math 
achievement, showing desegregation to produce essentially no benefit for 
the latter. He interprets the gain in black verbal achievement that is' 
found with desegregated schooling to be. a consequence of increased 
exposure to white Speech style, syntax, grammar* etc, H this 
interpretation has merit, it makes sense -that percentage of blacks in 
the classroom should be inversely related to such gains. The fewer the 
number of other blacks in the classroom* the more likely it is that the 
desegregated black child must interact with white, children and the less 
likely it is that he or she would find a within-race peer support group 
in which black speech is practiced and reinforced. 

Correction of Effect Size Estimates for "Overall School Improvement" 

The analyses presented above examine the achievement gains' of 
desegregated black children but ignore changes among their white 
classmates. It is important to examine the latter* however* because 
when both groups gain (or lose). It suggests that it is not 
desegr^egatior. per se that is responsible" for the effect* but instead, 
some other factor that has affected the school or school district as a 
whole, thereby Improving the academic performance of all of its 
students. Such factors might he: influx of new funding; improved 
curriculum materials; a new principal; renewed teacher enthusiasm; 
increased emphasis on preparation for state-mandated testing;' or 
ifhatever. 

Those sympathetic to the idea of desegr^gati^on mighty contend that when 
school changes such as those cited above appear hand in hand with 
desegregation, they should not be viewed as confounding effects* that 
is, as factors other than desegregated schooling that explain the 
obser\'ed minority gains. Instead* they should be thought of as natural 
covariates of desegregation* that is> as part of the meaning of the 
term. In other words> according to this line of thought> whenever one 
desegregates a school or school district , these simultaneous changes 
(whatever they are> and however unspecified they must remain) can be 
expected to co-occur with the change in the ratio of black and white- 
students. And as long as they regularly or naturally co-occur with 
desegregation* their academic benefits to minority children can be 
attributed to desegregation* In. this view, if whites gain along vith 
blacjcs> all the better. 

There are two problems with this line of thought. One lies in the 
validity of the assumption t|iat these school changes^ can be expected to 
co-occur routinely with desegregation in the future (or- in oth^r 
unsampled districts)* For Irtstance, today* in an era of minimal 
availability of increased: state and federal funding for schools* some of 
these msdiating factors (e,g,, new or improved curriculum and/or text' 
materials> or lower pupil-teacher ratios) may no longer be readily 
available to desegregating districts. Similarly^ 15 years ago teachers 
and principals may well have been more inclined to expect positive 
outcomes as a consequence of desegregation than they do today. Such 
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Desegre- 
gated 
blacks 



expectancies have often been found to be self-fulfilling for one reason 
or another. If present then* but not today, outcomes would again coffer 
depending on whether one included or excluded such factors in one's 
definition and implementation of desegregation. The strong negative 
correlations reported above between year of study and posltivity of both 
verbal and math effect size estimates argues strongly that one cannot 
rely routinely on the natural occurrence of these benefici^al 



ingredients. 



A second problem lies in one's definition of academic benefit. Some 
scholars argue that benefit should be defined in an absrolute sense. If 
desegregation produces academic gains for blacks, and does not produce 
losses for whl^tes, it is benef icial,_/IiQttchl5, view, it does not matter 
if the gains of white children equal ot^exiWeii those *^of blacks. An 
alternate view focuses instead on the closing of the academic 
achievement gap. Consequently, it defines desegregation as beneficial^ 
only If the gains of black children exceed those of whites. 

Three studies in the NIE core set, Beker (1967), Clark (1971), and Laird 
and Weeks (1966) , provide data that permits analysis of the effects of 
desegregation on white as well as black children. All seven available 
cases o^f the mean verbal, oath, or ^other test^^ effect size per study 
can be compared by using the fallowing .formula: 



X ^post - X pre 
pooled pre + post DD 



" /Receiving 
I School 
\ whites 



X post X pre 
pooled pre + post ED 



The resulting difference in effect sizes is -,379, (N=7, p>,05, 
S,D,=,894^, ^Although not significant with only seven cases, the 
direction of effect shows that the gains of white children In the j 
receiving schools of these studies substantially exceeded those of black 
children, which were roughly of the. same positive magnitude as the gains 
found for the entire sample of blacks. That is, the mean effect size 
for Ulacks in these, three studies (weighting tests equally) was +,15, 
(compared to the entire sample effect size of +,159), whereas the effect 
size for whites was +,52, In other words, the achievement gains of 
white ^children in these three studies were more than three times as 
large in standard units as those of their black classmates. 

In suomary, on the basis of this extremely small subsample, it appears 
that black gains relative to white gains were smalj.. In terms of the 
preceding discussion, these data suggest that the observed sgains of 
desegregated black children arc not attributable to the presence of 
white classmates per se. Instead, they appear due to more general 
improvements in schools or districts, that occur during the 
implementation of desegregation, 

DISCUSSION 



Interpretation of the Obtained Effect Size 

How does one interpret a mean effect size of +,159? In magnitude. It 
approaches the +,20 effect size that Walberg (1983) states is "average" 
for various educational interventions. Thus, on this basis the effects 
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of desegi:egation are relatively similar* tc other attempts co improve 
educational outcomes. Two points, however, bear reiteration with 
respect to this conclusion. First, as argued earlier, desegregation is 
not an educational program in the sense, for instance, -that are many, of 
the interventions examined in the Michigan group's quantitative 
summaries (Rulik, Shwalb, and Kulik, 1983; Cohen, and Ebeling, 1^80; 
Kulik, Kulik, and Cohen, 1979; Kulik, Kulik, and Cohen, 1980K 
Computer*based instruction, individualized instruction, open classrooms , . 
tutorial programs. Bloom's mastery learning, etc, all presumably 
improve educational performance as a consequence of identifiable , 
independent variables that comprise the program. The same cannot be 
said for school desegregation. At this point in time, we have not yet 
identified an undei;lying social psychological process vhichj as a result 
of . a change in the ratio of black and white students in a classroom or 
school, will augment minority scholastic achievement* Second, as 
implied by our analyses pointing to moderator variables and as suggested 
by our analyses of white student outcomes, when benefit to black 
students is found, it is not attributable to desegregation per se, but 
instead, to other school or district factors that accompany its 
imp lemen t a t ion * . 

Factors Affecting Academic Outcomes in Desegregated Settings 

As stated above, there is little good theoretical understanding of how 
desegregated schooling might improve the academic performance of 
minority children. Much past theorizing has not withstood the test of' 
data,' The next section briefly'discusses an array. of factors^ some of 
which were thought 'in the past to be relevant and some of which continue 
to appear important. 

Anxiety and threat , -.The fact that high anxiety impairs performance on 
complex or difficult tasks fits with common sense and is one of the 
better established findings of psychology. In his review of variably 
that affect black performance on cognitive tasks, KaZz (1968) summarized 
substantial evidence showing impairment when performing under the 
scrutiny of higher status whites. The administration of standardized 
achievement tests to blank students by a white teacher in. a white 
dominated setting, such as a des^egregated classroom* structurally 
parallels the situations studied and cited by Katz as impairing black 
performance. The fact that standardized achievement tests are 
administered w^ith time limits acts to further raise anxiety. Some 
evidence suggests that one-way busing of blacks to white receiving 
schools will increase their anxiety in general, at least during the 
initial phases of desegregation (e,g, , Gerard & Miller, 1975) • Mussen 
(1953) found that black children perceive more hostility or threat in 
their environment thanNlo whites, Baughman (1971) interprets the 
heightened level of worry and anxiety that black children attribute to 
their characters when asked to make up stories as confirming Mussen* s ^ 
results* . 

Taken together, such data implies that measured black performance is 
likely \to be an underestimate of true mastery; it implies that the 
obtained effect .sizes for black academic achievement do not reflect true 
level of- achievement. But if adult black intellectual activity is 
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performed in white world, aren't such depressed scores in fact 
legitimate scores? Perhaps^ but in work settings* performance is rarely 
under the constant scrutiny of a white supervisor. 

Self-concepts and aspirations . In the social science statement appended 
to. Brown , scholars argued that segregated schobllng lowered the 
self-concept of the minority child and that this in turn produced a 
sense of defeaxlsmi" self-doubt j and-lack-of aspiration -t Interfered 
with effective learning. Although; the argument appears credible, it has 
not withstood empirical analysis/ Not ohly^has the Interpretation of 
Clark's (1937) original doll preference data on which the argttment. was 
based been questioned (Brand* Ruiz & Padllla, 1974; Banks, 1976), but 
recent reviews of self-esteem research that employs direct self-report 
measures consistently show either higher levels of self-esteem among 
black children than among white children or no consistent effects (Epps_* 
1979* Porter & Washington, 1979* St, John* 1975* Stephan* 1978* Wylle* 
1979), Furthermore* if school desegregation does affect the self-esteem 
of black children* its effects* at least Initially, are more likely 
adverse than positive (Porter & Washington, 1979)* 

Measures of aspirations present a similar picture. Black children in 
segregated schools typically report higher aspirations than do white 
students (Epps* 1975; Proshensky & Newton * 1968; Weinberg* 1975). And 
black adults seem to value education more strongly than do whites 
(Wllson* 1970). The effect of desegregated schooling on the motivation 
, of black students remains unclear* some studies showing higher black 
aspirations in desegregated schools (Curtis* 1968; DeBord, Grlffen, & 
Clark, 1977; Fisher, 1971; Knapp 6 Hammer, 1971* Reniston, 1973), others 
showing an opposite effect (St, John*. 1966; White & Knight* 1973; 
Wilson, 1959^, and still others showing little difference between black 
children who attend segregated or desegregated schools (Curtis^ 1968; 
Falk, 1978; Hall & Wiant, 1973), Two polnjts must be made with respect ■ 
to this Issue. First, most experts today would agree that level of 
aspiration per se Is not as meaningful or important an Indicator of a 
healthy personality, as is a level of aspiration that is m line With 
one's level of performance and one^s obtained outcomes* Second* the 
nature or design of these studies does not allow causal interpretation 
of whatever differences are found. 

Finally* although the theorizing of social scientists at the time of 
Brown allowed for circular feedback loops (or bl-dlrectional or 
reciprocal causations) among self-esteem* motivation and aspiration* 
latergroup acceptance, and academic performance, their arguments clearly 
emphasized a causal pattern In Which personality variables (self-concept 
and achievement motivation) caused subsequent changes in academlc^ 
performance. If there is any preponderent direction of causal effect, 
researchers today would emphasize the Impact of school outcomes 
(academic performance and achievement) in forming personality or 
.creating changes In It, ^rather than a causal pattern in^vhlch changes In 
personality cause subsequent shifts In performance (Gottfredson, 1980;. ^ 
Miller* 1982; Rubin* Maruyama* & Kingsly* 1979; Scheirer & Kraut* 1979), 

Peer Comparison , Students know who is smart and who is not (Llpplt & / 
Goldv, 1959; Hoffman & Cohen, 1972), Differences in opportunity to / 
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perform, when coupled wich a narrow range ef valued abilities, acC to ' 
create widely shared percepClons of competence (Simpson* 19i81; 
Rosenholtz /& RosenholCs, 1981). When black children attend desegregated 
rather than segregated schools* social comparisons between their own 
academic performance .and that of white students will reveal disparities 
that^might be expected to lower performance. If such eJfects occur* 
they should be greater at higher grade levels in that * on the average, 
the academic disparities , between black and white students increase as 
they progress through school. 

On the .other hand, other data suggests that black children primarily 
compare themselves to other black children (Baughman, 197l). To the 
extent that the desegregation plan provides enough black children in 
each class to form the basis for a wlthln*-race comparison group* the 
debilitating effects of comparison with white children should be 
lessened. Moreover, children, like the rest of us* are self-protective 
and /adaptive. They find ways to ignore^ self-disparaging comparisons 
and/, as evidence on black children's self-esteem and aspirations shows* 
If anything, these children show high levels of self-regard and 
expectation in their self-reports. Whether^ or not these high levels 
-are "defensively high" as suggested by Entwisle & Hayduk, (1982)* and 
Miller, (1982), and reflect a negative consequence 6f peer comparison 
remains unclear* 
/ - * 

Expectations , As indicated above, expectations often create * 
self-fulfilling cycles. Expectations to. -^per^orm poorly cause behavior 
that subsequently confirms the expectation* But expectations are 
Intimately linked to actual behavior. Rehearsal .gof academic information 
and content improves perfG^ionance on subsequent testing of the mastery of 
this Information, It is the better student who volunteers the answer 
when^v the teacher calls for a response* who leads the discussion In peer 
tutoring or small work group exercises* and who the teacher routinely 
gives more opportunities to respond (Good, 1970), Thus* It is the 
better student who gets the benefit of overt rehearsal at the expense of 
less capable peers* thereby further improving the performance of the 
better student. The social dominance of whites when In interaction with 
blacks is well documented. Even when the resources and knowledge 
brought to the problem by black and white ^children is equivalent , the 
white child will initiate verbal comments more often than the black and 
will dominate the interaction* with the black child taking a more 
subordinate role (Cohen, 1982), ^Apparently* generalized status 
differences are implicit in the distinction between races. Even when 
black students are primed with correct information that makes them a 
more superior source of knowledge than the white children* the 
generalized status difference between blacks and whites nevertheless 
results In continued verbal dominance by the white children (Cohen & 
Roper* 1982; Tammlvaara, 1982), 

Pe^ relations . Some social scientists believed that the peer 
ei^lronment of the desegregated school would be critical. in producing 
academic gains (Coleman et al, 1965; Crain & Welssman*r: 1972; Pettigrew* 
1969), This belief rested on the assumptions ' that (a) the student body 
of a desegregated receiving school is more likely than that of a 
segregated school to be of middle class family background; (b). middle 
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class students art: more strongly oriented toward achievement' and thereby 
create a normative structure that emphasizes it; and (c) provided that 
the number of white students in the receiving school exceeds the number 
of incoming minority students* the latter group will adapt to the 
prevailing norm structure of the middle class whites. This argument, 
spelled out In detail by Katz (1964), rests on the additional assumption 
that minority children will- be accepted or befriended by white children, 

The latter assumption is at best, less true than one might wish. 
Resegregation is common in^'desegregated classrooms (e.g, , Rogers & 
Miller. 1980; Rogers & Miller. 1981; Schofield. 1980). and when white 
children accept minority students, it is a consequence of the minority 
students* good academic performance rather than a cause of it (Maruyama 
& Miller. 1979; Maruyama & Miller, 1983). Thus, it is not the peer 
system that provides a critical normative influence. Instead, as 
discussed in more detail below, it is provided by the teachers and 
administrators . 

School effects . Recent research, Jencks et al." (1972) notwithstanding, 
shows .that j^chools can exert powerful educational effect:s on students 
(Heyns, 1978) and differ in the extent to which they educate them 
(Edmonds. 1976). These ^effects are system or organization effects, 
produced in concert by principals, teachers, students', neighborhood, 
parents, and all having reciprocal influence on one another. This is 
not to argue that one cannot find, for instance, within-school 
differences among teachers both in their ba'ckground and their approach 
to education, or differences among students. It startles no one when a 
low social class background is found to be related to a student's 
academic performance (Hauser. 1978), Nor does^^ it elicit much more 
surprise to learn that the quality of teachers' education affects the 
academic outcomes of their pupils (Heim, 1970; Summers h Wolfe, 1977), 
More interesting, however, are the substantial differences in academic 
outcomes found among schools whose students are basically similar in 
social class background'and/or race. Although some authors have argued 
chat such school effects are small (e.g, , Sewell, Haller. & Fortes. 
1969), the studies on which such conclusions are based all use high 
school samples. By high school age. self*-fulf illing characteristics of 
background, expectations, and scholastic outcomes have homogenized 
schools, not unexpectedly leaving them similar in their educational 
impact, and consequently, leaving the false impression that the type of 
school attended cannot make a difference. At earlier a'^ges . however, the 
homogenization process is not completed. Interestingly, studies of 
elementary schools do show striking differences among schools. 

Two recent studies dramatically illustrate the powerful differences 
amongj schools in their effects on students (Brookover, Beady. Flood, 
Schweitzer. Wisenbaker. 1&79; Entwisle & Hayduk, 1982), Both are very 
substantial in terms of their breadth and the array of measures they 
employ. The Brr^okover et al, study is based on data from over 11,000 
students in the fourth and fifth grades in over 90 schools drawn by 
random from the entire State "of Michigan. Among those, 30 are majority 
black schools. This exceeds the totals of students and schools in the. 
entire array of the nineteen NIE sample desegregation studies by a 
margin of about 3 to 1, Entwisle and Hayduk (1982) studied 
approximately 1,500 children over a three*-year period from first to 
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third grade. Approximately one-third, respectively t attended a white 
middle class school* an integrated lower class school, and a black lover 
class school, . Although miuch smaller in terms of the number of schools 
studied, this study measured an even broader array of variables than the 
Brookover et al, study and on each, took, multiple (longitudinal) 
measurements on each child over the three-year course of the study* 
thereby enabling study of the temporal changes in the measured 
variables, .It is only with temporal spacing of- repeated measures on' the 
same child that one can begin to establish the causal connection between 
variables. Thus, the two studies differ substantially in the 
characteristics of their research. designs. KevertKeless, as will be 
indicated below* their results converge in identifying key aspects of 
the process of education, as well as showing that schools can produce 
very diffe;rent outcome's for children, - . 

Teachers . Earlier worV demonstrated that teachers e^cert powerful . 
effects on minority student outcomes (Johnson* Gerard* & Miller, 1975; 
Fraser* 1981), When desegregated minority children are imbedded in the 
classes of prejudiced teachers* their academic performance worsens* , 
whereas in the classes of \unprejudiced teachers, it improves (Johnson* 
Gerard, Miller, 1975), Furthermore, these effects can be traced to 
clear differences iti the way' in which these two.^ types of teachers ^ 
conduct their classes and ^nteract with minority students (Frazer* 
1981) , This .conclusion is jsupported by Brookover et al, and by Entwisle 
and Uayduk, In some lower 'class black schools the teachers (and the 
principal) have given up on the students. They do not view their 
students as capable of learning* attributing their poor academic 
outcomes to their backgrounds' and not demanding good and consistent work 
from them. It is Important to emphasize Here, that it is not merely 
teachers' expectations that produce th^e effectsv\J>ut instead* it is 
their behavior. In lower class black schools that produce pobr academic 
outcomes* students are not expected to perform up to grade level, and 
demands requiring them to do so are not placed on them. When teachers 
judge their students to be incompetent , \they do not attempt to cover as 
much academic material (Bees* 1970), 

Teachers in most lower class schools also fail to voice concrete 
achievement goals. Instead^ these children. are often reinforced for 
incorrect performance, hearing the ^^eacher say> for instance* "good try" 
when the answer is very clearly wrong* or not receiving immediate 
re-instruction when their response is incorrect (Brophy & Good* 1970), 
Academic norms of high academic achievement are recognized in high- 
achieving lover class black schools, whereas such norms and a commitment 
to academic mastery are missing in the low-achieving schools. In *the 
high-achieving schools, teachers spend most of the day instructing their 
students , reinforcing them discrimlnantly rather than indiscrlminantly , 
In these schools, teachers do not highly differentiate among students 
and* in the process* write off a large segment of them as unteachable, 

-Students , Although many factors may contribute to the greater sense of 
control over "their outcomes in life seen in middle class as opposed to 
lower class children (Coleman et al, 1966), the schools they attend seem 
to contribute to this observed difference* The students in 
low-achieving schools show a legiclinate sense of futility* and' with 
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reason. It is difficult for them to know what ^o^ expect, and the 
messages they get coiifuse and demoralize them. The teacher s;ays, **Good, 
you're trying hard"; or "OK"; but they receive C'^^ and D's on their 
report card. Consequently, thair expectations are not responsibly 
modified by their obtained grades. In contrast to a sense of mastery 
and controJl of their academic outcomes, these students feel the system 
is whimsical and "stacked against them," In contrast, children in high- 
achieving middle class schools increasingly come to forecast their 
school outcomes accurately. Their expectations more closeJly correspoad 
to. the grades they receive, with mo^t students predicting their marks 
correctly (Entwisle & Hayduk, 1982), Brookoyer et al, (1982) argue that 
a sense of control over school outcomes is one of the essential 
ingredients for high student achievement, 

* ^ ■ ■ - 

Implications of Academic Achieveofent Results in the Context of 
Educational Goals - / 

What does one make of the moderate positive effect of desegregation on 
the academic achievement of black children? Although not a strong 
clarion for desegregation in its own right, it certainly is not a 
deterrent to the continuation/ of desegregation as a national policy* 
More impfortant, however | is riie fact that other vaJluable educational 
goals cannot be met without desegregated schooling* Although cognitive 
development and academic mastery are obviously appropriate educational 
goals, they are not th6 only ones. Despite some recent signs of 
increased interest in "fundamental*' education, all school curricula to' 
some degree attend to dimensions other than verbal and mathematical^ a 
skill*;* Indeed, many components of the standard educational currici^lusK 
attend to dimensions that have little or no direct relevance to ^ 
cognitive mastery (e,g,, physical education; music, art, and aesthetic 
development; mechanical, shop, and home skills; industrial, business, 
and other vocational training; etc). ' 

In some sense all agree that schools must prepare children to function 
effectively in their adult lives. Thus, some view with despair the 
tracking of students within performance levels and in qualitatively 
different academic prograias because it functions to prepare students for 
occupational and social roles "that reflect their socioeconomic origins 
(Bowles & Gintis, 1976); and students within the different tracks do 
display attitudes and patterns of interpersonal behavior that are 
complementary 'to these future roles (Oakes, 1982)* 

Similarly, few would argue against the view that interpersonal skills 
are relevant to accomplishment and success in adulthood. In a 
multi-ethnic society, constructive modes of interethnlc interaction, as 
well as interethnic acceptance anjl trust, are valuable attributes. It 
is. both appropriate and feasible for schools to. develop children's 
strength and facility in these directions. But schools cannot do so if 
children lack day-to-day ^ontact with children' whose racial-ethnic 
ijdentities differ from their own,. The point here is not that contact 
/^er se can be counted on to produce interethnic acceptance. Recent 
studies show clearly that racial-ethnic boundaries function to organize 
patterns of social interaction in desegregated school settings 
(Singleton & Asher, 1979), Furthermore, racial-ethnic encapsulation is 



120 



more prevalent among girls than boys (Rogers & Miller, 1981; Schofleld & 
Francist 1982), and hostility is manifested more overtly on the 
playground than in classrooms (Rogers & Miller,* I98l) • The list of 
boundary conditions under which contact is likely to increase 
interethnic acceptance has grown increasingly longer (Cook> 1983; 
Stephan & Stephan, 1983), On the other hand, and perhaps in response to 
the growing realisation that they are needed, social scientists have 
begun to develop educational technologies that successfully promote 
increased interethnic acceptance (Aronson et al, 1978; Cohen' fit Roper, 
1972; Cook. 1982; DeVries, Edwards, & Slavin. 1978; Johnson, 1975; 
Rogers, Hennigan, & Miller. 1981; Sharan & Sharan, 1976; Slavin. 1978; 
Serow & Solomon. 1979), Though these procedures differ 'in their 
details, the common thread among them is their use of structured 
cooperative interaction in small groups, whether in conjunction with the 
curriculum or on the playground. Meta-analyses of their use not only 
show consistent andT substantial benefit ta interethnic Acceptance, but 
improved academic mastery when coordinated with academic curriculum 
materi^TsMJohnson, Haruyama. Johnson. Nelson. & Skon. 1981; Johnson. 
Johnson, & Maruyama. 1983), 

In suimnary. it is appropriate for schools to be concerned with 
children's development of effective and constructive interper^tonal 
skills. The capacity for interethnic acceptar ^, respect, and trust is 
an important aspect of intrapersonal developtaent and requires the 
existence of desegregated schools. Among the various goalff that might 
be achieved by desegregated schooling, increased interethnic acceptance 
most directly addresses the central concern of Brown , namely, the 
stigmatization of blacks, Thusi we would argue that even if on the 
average the effect of desegregated schooling on academic achievement was 
shown to be zero, desegregated schooling is required if the issue of 
interracial acceptance is to be addressed. 

Conclusion 

Taken together, the desegregation studies that meet the NIE minimal 
criteria show some moderate academic benefit to black children when they 
attend desegregated schools. Although one reviewer find's a larger 
margin of benefit among studies with stronger designs (Grain & Mahard, 
1978), most reviewers find that the magnitude of effect is smaller in 
studies with better research designs (e.g,, Krol. 1978; St, John, 1975)* 
Our calculation of the magnitude of these effects translates into the 
rather trivial increase of about tweinty points on the typical SAT 
college entrance test which has a mean of 500 and a standard deviation 
of 100* Most studies of desegregation assess the effects of only a year 
of desegregated schooling. The likelihood. .however , that twelye years 
of desegregated schooling will translate into an average gain of over 
200 points .^(tvo standard deviations) on an SAT-type of test seems low. 
Our own longitudinal data from Riverside, California certainly argue 
against such a view (Gerard & Miller, 1975) • On the other hand, the 
high likelihood that^the same level of performance is evaluated more 
favorably by the external world if a black student attends a 
desegregated, as opposed to a segregated, school must be added to'this 
picture. Given equal '^grade/ point averages or achievement test scores, 
the black student from a desegregated school is likely to be viewed as 
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more capable and promising than his or her peer from ^ eegxegated 
" , school. 

Our analyses of these and other data argue that the ratio of black and ^ 
white students per se is probably not a direct causal factor in 
producing the smAll positive effect that is found. The fact that the 
magnitude'of benefit is greater in studies conducted in the sixties than 
in those of the seventies supports this interpretation. The higher 
expectations and greater resources available in the earlier era should ' ^ 

have generated increased morale and greater disruption of the status 
quOf thereby breaking the sy^stem effects that ordinarily 'depress the 
academic mastery of black children. Thus, we argue phat whatever the 
academic effects found, they are due to teachers and schools and only 
attributable to changes in the percentages of black and white students 
to the extent that such changes concomitantly change teachers and 
schools* . ' V 

Given the school effects that have been described in earlier sections, 
one could argue that such results essentially argue against tlie 
desegregation of schools. Implying as they do that lover class minority 
' schools can be effective, education administrators should simply ntake 
the changejs nec^ssa^* to see that all such schools function effectively. 
Such a suggestion is not without merit, but is not easy to implement. 
When new teachers are brought Into such schools to replace old ones* the 
normative structure exerts its influence on them, making them similar in 
outlook and practice to those they replaced, ; Such systems of nonss can 
continue to show their effects, even when all the persons in the system 
have one by one been replaced (Jacobs & Campbell, 1961), As new persons 
come into the system they too adopt the old norms, and in turn, transmit 
them to still newer replacements. 

For these reasons , a change in the black child's school environment is ^ 
mere easily achieved by moving him or her to a more middle class school* 
than by attempting to change the school currently being attended. 
Middle class schools, being more ^likely to be high-achieving schools, 
are less likely to have Ll:sse debilitating systems of ncrms. Such a 
change can also give the minority student a sense of a fresh start, 

Xn conclusion, the fact thar school desegregation does not depress the 
academic performance of black children^ but instead is moderately 
positive in its effect, (and as revealed in other reviews, does not 
adversely affect the academic performance of white children), means that ^ 
if there are other compelling reasons to desegregate schools, , ' 

consideration of academic achievement provides no deterrence. Because 
'racially mixed schools are necessary if effective programs for 
Increasing intergroup acceptance are to be applied, school desegregation 
should be encouraged, . ^ 
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Footnotes 



1, Technically termed regression* this effect is due to the fact that 
the measuring instruments (tests) do not tell us each person^s true 

■ score; \here is^'a component o^ error in each score, 

2, In determining whether or not the amount of variability across the v 
studies exceeds that which would be expected on the basis of 
sampling error, it is necessary to weight the effect size estimates 
by sample size* Because smaller sample ^izes are associated with 
increased imprecision of effect size estimate* it is important to 
assign such cases less weight ^so as not to overestimate the extent^ 
of -^variability that occwjcs over and above sampling error (i,e, to 
avoid overstating the case for the operation of moderator 
variables). It should be noted* however* that although taking a 
nonweighting approach normally will increase the likelihood of 
falsely concluding that moderators are present* this same 
procedure^ .which is th& one that we do use for estliaating the 
correlation between moderator variables and effect size* is 
conservative in this latter regard. The reason for this is that 
cases involving increased attenuation (via the imprecision of small 
samples) are given equal weight in determining the amount of 
correlation* <i 
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1) Type of Study 



2) 



a) uon empirical 

b) sxunmary report 



Location 



a) outside USA 

b) geographically non specific 



3) Comparisons 



a) not a study of achievement of desegregated blacks 

b) multi^ethnlc combined 

c) comparisons ''across ethnics only 

d) heterogeneoiSs proportions minority in desegregated 
condition 

e) . no -control data 

f) no pre-desegregation data 

control measures not contemporaneous 

h) majority black in a segregated condition (unless the 
reviewer provides specific Justification) . 

i) varied exposure to desegregation (unless the 
reviewer provides a specific justification 
demons t rating that the vatiation in exposure' 
time is not meaningful) * 

4) Study Desegregation " ' 

a) ' cross-sectional survey 

b) sampling procedure unknown 

c) separate non -comparable samples at each 
observation 

5) Measures 

a) unreliable and/or unstandardized instruments 

b) test content and/or instrument utiknovn 
jc) dates of administration unknown 

d) different tests used in pretests^'and posttests 

e) test of IQ or verbal ability ■ 



6) Data Analysis 



a) no pretest means 

b) no posttest means^ unless the author reported 
pretest scores and -gains 

c) no data presented ' ' 

d) The following will be rejected dependent upon the 
amount of infortnation available for the reviewer to 
estimate values 

1, no pretest standard deviations 

2, no posttest standard deviations ■? 
3* no significance %^sts 

4, N's*n'ot discem:;al)le 
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It was decided that "excessive attrition" and "groups that are initially 
non-comparable" woulrf not be used as criterion for rejection. In each 
case it was argued that the point at which the problem became an issue 
was extremely vague. It was felt that ,the project is better served by 
including studies exhibiting attrition and comparability problems and 
allowing individual reviewers to articulate these limitations. Using 
this criteria* 19 studies were studied which were* deemed acceptable for 
inclusion in the project. These are; 

Anderson, Lewis V, The effect of desegregation on the achievement and . 
personality of Negro children . Unpublished doctoraL dissertation. . 
George Peabody College for Teachers. 1966, (University Microfilm 
66-11." 237) * * / 

Baker, Jerome, A study of Integration in racially imbalanced urban 
public schools , Syracuse* N<iw York; Syracuse University Youth 
Development Center* Final Report, Hay 1977, 

Bowman* Orrin H, Scholastic development of disadvantaged Negro pupils; 
A study of pupils in selected segregated and desegregated 
elementary classrooms , /Unpublished doctoral dl^ertation* 
University of New York at Buffalo. 1973,^ ( 



Carrigan* Patricia M, School desegregation via compulsory pupil 
transfer; Early effects on elemental^ school children , 
Ann Arbor, Michigan; Ann Arbor Publik Schools., 1979, 

; ' 

Clark, El Nadel, Analysis ox the difference between pre- .and post-test 
scores (change scores) on measures of self-concept, academic 
aptitude, and reading achievement earned by . sixth- grade students 
attending segregated and desegregated schools . Unpublished 
doctoral dissertation, Duke University, 1971, 

Evans > Charles L, Short term desegregation effects; The acadeialc ■ 
achievement of bused students 1971-1972 , Fort Worth, Texas; 
Fort Worth Independent School District, X973, (ERIC, No, Ed 086 
759) 

Iwanicki, E,F, , & Gable R,X, A quasi-experimental evaluation of the 
effects of a voluntary urban/suburban busing program on student 
achievement * Paper presented at the Annual meeting of the 
American Educational Resear^ch Association, Toronto, Canada, 

^ March 1978, 

Klein, Robert Stanley, A comparative study of the academic achievement 
of Negro tenth grade high school students attending segregated 
and 'recently integrated schools In a metropolitan .area in the 
, south , Unpubllsh^;3?^octoral dissertation. University of South 
Carolina, 1967^ ^ 

Laird, M_,A, & Weeks, G, The effect of busing on achievement in reading 
and arithmetic In three Philadelphia schools , Philadelphia, 
Pennsylvania; The School District of ;Philadelphia, Division of 
Research* 1966, ^ ' ' 




132 



Rentschf George J, - Open-enrollment; An appraisal . Unpublished 
doctoral dissertation* State University of New York* Buffalo, 
1967, . . 



Savage. L,W, Arithmetic achievement of black students transferring 
from a segregated junior high school to an, integrated junior 
high school . Unpublished masters thesis, Virginia State College, 
1971, 

Sheehant Daniel S,. "Black achievement ±4 a desegregated school 

district," Journal of Spcial Psychology , 1979. 107. 165-182, 

Slone. Irene W, The Effects of one school pairing on pupil 

achievement, anxieties and attitudes . Unpublished doctoral 
' dissertation. New York University. 1968, 

Syracuse City Scho9l District, Study of the effects of integration — 
Washington Irving and Host Pupils, Hearing held in Rochester, 
New York. September 16-17. V^S* Commission on Civil Rights , 

Thompson. E,W, . & Smidchens. U, Longitudinal effects of school 
racial/ethnic composition upon student achievement. , - 5aper 
presented 

at the Annual Meeting of the American Educational Research 
Association (San Francisco. California, April. 1979, 

Van Every. D,W, Effects of desegregation on public school groups of 
sixth graders in terms of achievement levels and attitudes 
toward school . Doctoral dissertation, Wayne State University, 
1969, Dissertation Abstracts International. 1969, (University 
Microfilms No, 70-19074) 

Walberg. Herbert J, An evaluation of an urban-suburban school busing 
program; Student achievement and perception of class learning 
environments , Paper presented at the annual meeting of the 
American Educational Research Association. New York, New York, 
February" 1971, 

* ' " : ,^ - 

Zdep. Stanley M, "Educating disadvantaged urban children in suburban 
schools; An evaluation,*' Journal of Applied' Social Psychology . 
1971, (ERIC No. ED 053 186 TM 00716), 




133 



131 



Blacks and ££qhil: The Effects of 
- School Desegregation on Black Students* 

Walter C, Stephan 
Kew Mexico State University 



The Effects of Segregation and Desegregation 

It is iTDportant to put the question of the effec^,^ oi ^desegregation on 
Black achievement in historical context. To do this I would like to 
quote from social scientists and other expert witnesses who ''testified in 
the Brown (1954) trial. It is clear froia their testimony that the 
social scientists believed that segregation had a negative impact on 
Black achievement In at least three ways. 

First, the fact that segregated Black schools were _ inferior to White 
schools in terms of the quali^ry of the facilities and per pupil 
expenditures was thought to lead' to lo>7 levels of achievement. Prior to 
Brown it was not^ uncommon for Southern states to allocate from 2 to 5 
times as much money per pupil for White students as was allocated for 
Blacks fAshmors, 1-954; Thompson, 1975^, 'Also, Black schools in the 
South had teachers who were ^ less well trained and who were paid about 
half as much as teachers in White schools (Ashmore, 1954), Conditions 
in Black schools were often appalling. Consider the findings of Matthew 
Whitehead vho testified about the schools in Clarendon. County, South 
Carolina, during the Briggs vs, Elliot (1951) case, 

"The total value of the buildings, grounds, and 
furnishings of^ the two white schools, that accommodated 276 
children was four times as high as the total for the three Negro 
schools that accommodated a total of 808 students. The white 
schools were constructed of brick and stucco; there was one 
.teacher- for each 28 children; at the colored schools, there 
^■was one teacher for each 47 children. At the white high 
School, there was only one class with an enrollment as high as 
2^4; at the Scott's Branch high School for Negroes, classes 
ranged from 33 to 47, Besides the courses offered at both 
schools, the curriculum at the white high school included 
biology, typing, and bookkeeping; at the bl^ck high school, 
only' agriculture and home economics were offered. There was 
no-prunning water at one of the'' two outlying colored grade schools 
ano no electricity at the other one. There were indoor flush 
^to|LJifets;;:3t both white schools but no flush toilets. Indoors or 
c^vV'I^Sii^oo'r^s ; at any of the Negro .schools— only, outhouses, and not 
5Gfafi:^'enough of them," 'fKluger, 1976, p, 332) 

■ > 'h^ ' 



*The author wishes to thank Deanna Nielson for her assistance in 
preparing t^his article, - 



134 



Second, it was thought that the /T)adge of inferiority" that segregation 
represented led Black sttrdents, 'and their teachers, to have low 
expectations regarding their capacities to achieve, Th&se low 
expectations were believed to lead to low achievement. This argument 
ran be traced through tha testimony of several social scientists," David 
Krech said: 

"Legal segregation, because it is legal, because it 
is obvious to everyone, gives, ,, environmental support 
for the belief that K jroes are in some way different from 
and inferior to white people," (Kluger, 1976, p, 362) 

In another trial Horace English testified that: 

"if we din it into a person that he is incapable 
of learning, then he is less likely to be able to learn,,. 
There is a tendency for us to live up to — or perhaps — I 
should say down to social expectations and to learn what 
people say we can. learn, and legal segregation definitely 
depresses the Negro*s expectancy and is therefore prejudicial to 
his learning," (Kluger, 1976, p, A15) 

Third, in addition to reducing expectancies, segregation was also 
thought to reduce the motivation^ to learn among Black students, 
Brewster Smith testified that: 

"Segregation is, in itself, under the social 
circumstances, in which it occurs, a social and official 
insult and has widely ramifying consequences on the 
individual's motivation to- learn," (Kluger, 1976, p,491) 

And Louisa Holt argued that; 

/ "The fact that segregation is enforced,,, 
gives legal and official sanction to a policy which is 
inevitably interpreted both by white people and by 
Negroes as denoting the inferiority of the Negro group,,, 
A sense of inferiority must always affect one*s motivation 
for learning since it affects the feeling one has for one's 
self as a person," (Kluger, 1976, p, 421) 

In the original Brown (1951) decision this line of reasoning was . 
sufficient to convince Judge Huxman that: 

"Segregation of white and colored children in public 
schools has a detrimental effect upon the colored children. 
The impact is greater when it has the sanction of the law; 
for the poliey of separating the races is usually interpreted 
as denoting the inferiority of the Negro group, A sense of 
inferiority affects the motivation of a child to learn. 
Segregation with the sanction of law, therefore, has a tendency 
to retard the educational and mental development of Negro children 
and to deprive them of some of the benefits they would receive 
in a racially integrated school system," (Kluger, 1976, p, A2A) 
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To Summarize > it was because segregation was associated with Inferior 
schools and led to low levels of expectancy and motivation in Black 
children that it was believed to cause low levels of achievement. At 
the time little or no data existed on the relative achievement levels of 
Blacks and Whites in segregated schools. Thus, the argument rested on ' 
reason* not fact. 

Because the Brown trials were concerned with the negative effects to 
segregation* minimal con'sideration was given fo^the anticipated effects ' 
of desegregation. In fact, desegregation as a remedy for segregation 
was rarely mentioned (Kluger, 1976), The social: scientists' arguments 
concerning the effects of segregation implied that reinoving the^ "b^dge 
of inferiority" represented "by segregation would increase the academic 
expectancies and motivation' of Blacks and that these Increases* along 
with improved facilities and ins^truction* would lead to higher* 
achievement. 

Subsequent theorizing about the effects of segregation and desegregation 
on Black achievement has elaborated on these* basic notions. For 
instance, the u,S, Commission on'Civil Rights' study of Racial Isolation 
in the Public Schools suggested that; , ^ 

"Negro children suffer serious ha-nn when thei^ education 
takes place in public schools which are racially segregated* 
whatever the source of such segregation may be, Negro children ^ 
who attend predominantly Negro schools do not achieve as well as 
other children* Negro and White, Their aspirations are more 
restricted than those of other children and they do not have much 
confidence that they can influence their own futures," (1967) 

s 

Jencks and his colleagues (Jencks, Smith, Aclard, Bane, Co^en* Bintis, 
Heyns and Michelson* 1972* pp, 97-98> offered fccr reasons why 
desegregation should improve Black achievement. First* they cited the 
anticipated positive effects of improvements in school and teacher 
quality, c Second* they cited the knowledge that may be acquired from 
White.peers who' have been socialized into middle class White norms— the 
lateral transmission of values hypothesis (for evidence that this does 
not occur see Miller* 1981), Third* Jencks et al, suggested that 
teachers in desegregated schools may expect more from Blacks and thic; 
may lead Blacks to learn more, ^Fourth* desegregation may lead Blacks to 
expect that they have a better chance of making it in society which may 
motivate them to work harder and learn more (for a synthesis of many of 
these arguments see Linsenmeier and Wortman* 1978), 
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Achievement Tests 

All of- the studies to be considered in this analysis of the effects^of 
desegregation on Black achievement employed standardized achievement " 
tests. Any understanding of the results of these studies requ!ires that^ 
some consideration be given to the nature of these tests, ' Achievement ' 
tests were developed to measure what students have^leamed. They 
consist of items that sample the general body of knowledge that schools 
are expected to teach. The items that are selected are those that 
discriminate best between students who have learned a great deal 
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and those 'who have not, Xtems which sample knowledge that everyone 
learns. are not included. This restricts the type of knowledgeysampled 
to that which is not always learned or taught, ^ / 

The tests usually take one to^ three hours to complete, Buring this 
period students ^^at the junior high school level attempt to' answer-^ ' 
approximately 85 multiple choice questions per hour, ^,rhe content areas 
covered most thoroughly fand the only ones reported" in most 
desegregation studies) are math and verbal skills. Some tests deal with 
^science and social studies » but use less extensive coverage for these 
' toglrcs-. Thus* these tests examine only a very restricted domain of 
^achievement. This domain, verbal and math skills, is clearly important^ 
but so too are other domains of achievement that are not measured. 
Among these other domains are knowledge of our. politicalt economic* and 
legal systems* and knowledge of the history of our society and other 
countries. 

Scores on these tests correlate reasonably well from year to year and 
they correlate reasonably well with tests designed to measure aptitude 
and intelligence (Jencks et al, , 1972, p, 60; Wallach* 1976), However, 
^neither achievement tests nor those designed specifically for the 
purpose are especially good at predicting college grades or later 
success in life (Jencks et.al,, 1972* p, 57), 

The test that has been most extensively scrutinized in this regard is 
the Scholastic Aptitude Test fSAT) developed by the Educational Testing 
Service, More than 2*000 studies have examined the ability of this test 
to predict future academic performance. The results indicate that the 
SAT corielates about ,30 to ,40 with first year college grades (Lord and 
Campos* cited in Linn* 1982), SAT scores do not correlate as well with 
overall college grades (Humphreys, 1968) nor do they predict .whether or 
not students will finish college (Astin* 1970),- Also* there is iittle 
relationship between SAT scores (or similar measures such as the G|^) 
and later success after college (Harston* 1971; KcClelland* 1971), In 
sum* the SAT and most standardized achievement test5% have high content 
and construct validity* but only low to moderate predictive validity. 

We must be extremely cautious in Interpreting the meaning of achievement 
scores. They reflect the amount of standard curriculum material? in the 
donKiin of math and verbal skills that students have learned. Thus, 
achievement scores may serve as an indicator of the quality of the math 
and verbal skills programs at the schools the students are attending, 
although the same icaterial may be acquired, .in the home^ from peers, or 
from the mass media. To the extent that desegregation has an effect on 
achievement scores. It may be caused by changes in the quality and 
amount of instruction In math and verbal skills* changes in the quality 
of the student body, or changes in the students' motivation to learn. 
The changes that do occur, probably should not be interpreted as an 
indication that the students will subsequently be more or less 
successful in institutions of higher education or in economic terms. 
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I do not mean to intply that test scores; are not itnpor-tant^ but T believe 
they are often Important for the wrong reasons. Scores on achievement 
tests ar^ used as criteria to determine what tra^cks students will be 
assigned to and whether students wil] be admitted to college. They are 
also important because students and teachers perceive the scores as an 
indication of ability and Individua] worth. In this way, these tests 
may place inappropriate limits on the aspirations and self evaluations 
of low scoring students and" they may lead teachers to^have low 
expectations for low scoring students (For evidence that teachers have 
low expectations sed Mercer, ladicola, and Moore* 1980) ,, 



Because these tests measure what students have learned, anything that 
affects how^much material they are taught or their capacity to 
assimilat^what is presented will affect achievement test scores. 
Curriculum changes* differences in styles of presentation and testing* 
and disruptions cfeat influence the capacity of teachers to teach or 
students* ability or desire to learn are likely to have a negative 
impact on what students learn. Because many of the studies reported in 
the literature cover only the initial phases ci school desegregation 

they are very likely to'be affected by these- factors— In-parrt-icul&r-*- 

the learning environment Is/ apt to differ from the students' previous 
experiences* especially for minority students. Some of these/' 
differences may be beneficial in the long run such as. more demanding 
teachers* more competitive classmates* and greater diversity in the ' 
student body, but these factors may initially .have negative effects on 
achievement. Other factors such as tension and conflict between groups* 
negative comparisons with better prepared students who ara often higher 
in social class* and dealing with teachers who have little experience 
teaching minority group students probably have a negative impact and 
continue to do do, , " 
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Although achi«vement--test£_are__des±gnedl.t6_measure what students... have ._ . 
learned, scores on these tests are also affected by other factors. Most 
Important among these other factors is the situation in which the tests 
are administered. In particular, high anxifety levels have a negative 
effect on performance* except for the very best students. It is 
possible that Black students taking these tests in desegregated schools 
experience more anxiety than Blacks in segregated schools; This is 
likely to be the case to the extent that achievement is emphasized in 
desegregated schools and the Black students fee] academically inferior 
to or threateneid by the White students. 

Achievement tests are "speeded" which means that students have a time 
limit that is too short for many of them to finish all thei items.. This 
too may create anxiety; it also means that a premium is pliaced on 
motivation and attentiveness. Students who are not motivated to do well 
or who do not try hard will not score well on these testg,: Lapses of 
attention that amount to 5 minutes during the testing houi| will mean 
failing to answer about 7 questions (at the junior high level). This 
could affect the outcomes by more than 50 points ^on test^ that have 
range of 200-800 with an average of 500), The tests are most likelv'to 
yield accurate results when the conditions of testing do not elic^^high 
levels of anxiety and the students are motivated to do well and/^re 
attentive. 
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While thes^ factors would be expected to influence measures/of 
achievem^t both before and after desegregation, it would not be 
surpiJfSing to find that they had a more negative impact after 
des^regation , 

/fhe race of the examiner ^an also affect test performance.. Blacks often 
^ perform better when the examiner is Black rather than White (e,g,, Katz, 
Roberts, and. Robinson , 1965), It is frequently the case that as 
■students move from segregated to desegregated schools the race of the 
examiners changes from Black to White, Regrettably, we have no 
information on the degree to which such factors actually have affected 
the results, of the studies we are reviewing, but they should lead us to 
be cautious about interpreting these studies. 

The Studies in the H,I,R, Study Set 

Anderson. 



This early study examines an unusual early desegregation plan ■ in which 
students in the numerical^minority in a given "s'chool could transfer to 
schools in which their group was in the malority. Thus, students could 
transfer from desegregated to segregated schools. The study was done in 
Hashville in 1963., It followed students from the 2nd to the 4th grade. 
The Metropolitan Achievement Tests were used to measure reading and math 
achievement. The sample size was adequate {H»34 in the desegregated 
group), but^not large. It is possible that some of the stuf3ents in the 
desegregated group were exposed to one year of desegregation prior to 
^being pretested in the second grade. It appears from the report that 
this problem probably affected less than one-sixth of the students in 
this gx^jip, " 

/ ■ . " 

Beker 

Like ^ost early studies, the des^regation that was examined In this 
study (1964) consisted of voluntary transfers. The study was done in a 
large Northern city. Two grade levels were Included (grades 2 and 3), 
The sample sizes were very small and may yield unreliable results 
i'N = 7 -251 , The study is a Fall-t.o-Sprlng comparison of reading and 
ma.th abilities done during the first year of desegregation (measured 
with the Stanford Achievement Test), 

Bowman 

This Is one of the longer studies in the set, ^ It runs from 1967 to , 
1970, A group of students was followed from grades 1 to 3 and another 
group from grades 3 to 5, The. sample sl^^es were of moderate size 
(around 50 total at each grade level), but adequate. The students 
participated in the program voluntarily and it took place in a medium- 
sized Northern city (Syracuse), Different tests, the Iowa Tests of 
Basic Skills and New York State's Tests, were used to measure 
achievement at the pretest and the posttest levels which makes changes 
in test scores somewhat difficult to Interpret. 
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/ 

Carrigan ^ / 

I did not calculate effect sizes for this study because I believe the 
control group cannot be used to assess the effects of desegregation. 
In this stuc; the control, group was attending desegregated schools (50% 
^ Black) • Since this control group had already received the "treatment" 
of desegregation, they provide a check primarily for maturation effects. 
Any, changes in this group may be a consequence of ongoing exposure to 
desegregation* which means that the differences occurring in this group 
are not a proper control for the differences in the '-desegregated'* 
group. Also* the "desegregated" group actually started out in a 
somewhat desegregated school (80% Black), so this i s not an optimal 
group to measure the effects of desegregation* 

fj 4,, 

Clerk 

This is one of the small number of studies in the set that was done in 
the South, It is a study of a majority^-to-minorlty transfer program 
that took place in 1969-1970, The sample size is adequate (N » 108 for 
desegregated group),, but the duration of the study is brief, extending 
from Fall to Spring, This is the only study in the set that includes 
rural students. It covers only the sixth grade and provides both a test 
of reading and math (SCAT), 

Evans . 

This study was done in Fort Worth during the 1971-1972 school year. The 
Iowa Tests of Basic Skills were given to 4th and 5th grade students in 
the Fall and Spring of that year. The court-ordered desegregation plan 
involved clustering elementary students and busing Black students (in 
grades 3-5) to achieve a degree of racial balance, ^ Tlie sample sizes 
were larger than in most of the other studies in this^set (N = 179*393) • 

Iwanicki and Gable 

I excluded this study because the "predesegregation" group had already 
been attending desegregated schools for a full academic year at the time 
of the "pretest," Thus,. the predesegregation comparison is actually a 
.cross-sectional comparison between a segregated control group and a 
group of students that has been desegregated for one year. This means 
that the measure of the effects of desegregation is a measure of the 
effects of the second year of desegregation. Since all of the other 
^studies that I have included measured the first year of desegregation* 
including this study with the others may yield an inaccurate picture of 
the effects of desegregation. This would be particularly true if 
desegregation had a greater impact on achievement during the first year 
than during subsequent years, ' * , - 

Klein 

This is a Fall^to^Spring examination of the effects of desegregation 
done in" a small city (35t000) in the South, The students were in the 
tenth grade. The sample _size was adequate (K =38 in the 



desegregated group>, but not large*' The study was done in 1963* The 
desegregation plan was a voluntary one involving Black students who 
transferred front segregated Black schools to^ White schoals* The tests 
used were the Math and English Cooperative Exams, 

Laird and Weeks , , 

This is an early study of the effects of desegregation (1964>, It was 
done in a large Northern city (Philadelphia) over a 1^-year time span. 
Desegregation was brought about by overcrowding in a segregated Black 
school* Parents in this school could request to transfer their children 
to White schools so desegregation was voluntary* Students in grades 4-6 
were tested on the district's own verbal and math tests,. The sample 
sjze at each grade level is modest (22^39), but acceptable. 

Rentsch 

This study was done on a voluntary desegregation plan in Rochester, New 
York, and cover's a 2'-year time period* There were adequate sample sizes 
(N " 27 to 33) to calculate effects in grades 3^-5, The students were 
-tested on reading and math skills (apparently using a test developed by 
the District), The students who attended the desegregated schools had 
previously attended schools that were 90Z^ minority* Attrition was 
fairly high in this group (56%). Although this study provided analyses 
of both matched and unmatched samples of segregated and desegregated 
students, .1 decided against using the analyses of the matched groups 
because the :sample sizes were small fN = 9^-13), 

Savage 

c 

This study covered a longer time period than many of the others, 2 
years, and it is one of the minority of studies that were conducted in 
the South (Richmond, Va,). Also« it is one of the relatively small 
number of studies examining senlor'high school ^students. The sample 
size is adequate (N = 42 in the desegregated group) to calculate 
reliable means for icath and reading achievement on the Sequential 
Educational Progress Test-. The study was conducted between 1969 and 
1971 and axamlned a voluntary desegregation plan involving minorlty-to- 
majority transfers. 

Sheehan and Marcus 

This study .was done in Dallas, Texas^ and covers a 1^-year period, Tt 
involves court ordered busing and it >^as done recently (1976-1978). In 
these regards it is more representative of urban desegregation programs 
than most of the other studies In the set* The fourth grade students 
were measured with the Iowa Test of Basic Skills. The sample size is 
very large fnearly 2,000). One drawback is that the degree of 
desegregation varle'd considerably within the desegregfited sample (from 
5^ to 65% Black). - . . ^ 
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This is a study of the second year of school desegregation. 
Desegregation occurred during the 1963-1964 school year. The first 
measure of achievement was gathered in April 1965 and the second in\ 
March 1966, The predesegregation school was multi-ethnic (90% minority, 
but only about'. 70% Black) and thus this study differs from the other 
studies of desegregation. Also, the "segregated" control group wes^ . ^ 
attending a school that was iOI White, Since the predesegregation 
levels of achievement cannot be determined , the effects of desegregation 
cannot be evaluated. 

Smith 

V 

This is a long-term study, covering 3 school years. It was conducted in 
Tulsa, Oklahoma, The students were pretested in seventh grade and ^ 
posttested in ninth grade. The sample size is larger than in most 
studies (K ■ 274), The Stanford Achievement Tests were used to measure 
math and verbal skills. The desegregated students were attending 
naturally integrated junior high schools. Unfortunately, no information 
was provided on the degree of segregation in Tulsa's elementary schools, 
but it is probably reasonable to assume a high level of segregation 
given that the study began ±n^l965, 

Syracuse 

This study of fourth grade students measured reading achievement 
(Stanford Achievement Test) in the Fall and Spring of the 1965-1966 
school year. The number of students in the desegregated group was 
small, but adequate (N = 24), The type of desegregation program the 
students participated in is not specified in the report, 

Thompson and Smldchens 

This study of natural desegregation in the elementary schools of Ann ^ 
Arbor was eliminated from the analyses because the. students had been 
attending desegregated schools for 2 years before thfe predesegregation 
measures were obtained. Thus, this study lacks a true predesegregation 
tncasure. In ^iddition, the "segregated" control group was 58% VJhite, 

Van Every . ^ 

This study was done in Flint, Michigan, and involves desegregation 
produced by locating a low-cost housing project in a previously all 
White neighborhood. The study covers a 2-year period, following 
students from the fourth to the sixth grade. The sample size is 
somewhat small (desegregated group ^ 22) , The study was completed 
in 1969, The Science Research Associates- tests for readlng^nd niath 
were used. Research Associates tests' for reading and math were used, 

Walberg i ' ' . 

This is a study of the Boston Metro Project in which urban Black 
students at all grade levels were voluntarily bused to suburbiin White 
schools. The performance of these Black students on the 
Metropolitan Achievement Tests for reading and math were compared to 
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the performance of their siblings who rema-ined i^ segregated Blacks 
schools, The'study vas condij^cted during 1968-1969, The sample sizes 
for the desegregated groups are laoder^te (N " 61-144), those for the 
segregated groups are sBialler (N » lA-53), but still reasonably 
adequate, ^ 

This is a study of a voluntary transfer plan in vhich urtan Blacks could 
attend suburban schools.. The students vere very young^ (grade 2). The 
Metropolitan Readiness Test was used to measure reading and math ability 
in the' Pall and during the Spring of the first year of desegregation. " 
The ^ study was done in 1968, The sainple size vas quite small and>may^not- 
yield reliable results (K "12 in the desegregated group). The report 
does not indicate where the study was done. 

In summary, the desegregation in th'ese studies was typically ^voluntary 
(66% of the cases), the cities it occurred in were generally medium' to 
large, the region was more often the Worth than the South, the schools 
the students attended were more frequently elementary schools thaa 
secondary schools (X grade level « 5.5) » Blacks were very much in the 
iTiinority in most of these schools, and most of the studies were 
conducted prior to 1970 (X 1968)* 

Effect Sizes 

The principal measure of ii^terest to be extracted from these studies is 
the size of the effects of desegregation on the verbal and laath 
achievement of Black students. To calculate these effect sizes the 
formulas^ proposed by Glass (1977) vere einployed. In calculating these 
effect sizes I have taken into consideration the duration of the study. 

All of the studies included in the study set employ quasi- 
experimental designs in which one group of students is tested before and 
after desegregation. The results for these students are compared to 
those of a group of students who remain in segregated schools and who 
ar« pretested arid posttested at the same tinie as the desegregated group. 
The generic formula to obtain effect sizes in, standard deviation units 
for this design is to calculate the difference between the desegregated 
and segregated groups at the pretest and divide this score by the 
standard deviation for the segregated group. 



1) 



Xt -I 



^ » "prtttst tflfftrtnct 



This score Indicates the. degree of pretest equality between the two 
groups, A similar score, is then obtained for the posttest scores. 



2> 



To derive an overall effect size the pretest difference (l) is 
subtracted from the posttest difference (2). This formula yields an 
index of the magnitude of the effects of desegregation in units that can 
be compared across studies. 
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The use of the standard deviation of the control group (the segregated^ 
group in thijs case) to calculate effect sizes was proposed by Glass 
(1977). It would be possible to use, in place of this standard deviation 
a pooled standard deviation comprised of the average of the standard 
deviations of the experimental and control groups on the assumption that 
this would yield a more stable estimate of the standard deviation. This 
more complex approach would be justified if the standard deviations of 
the experimental and conl;rol groups differed substantially. ' This 
appears not to have beeri the case in the present set of studies. In no 
instance (on the pretest or the posttest) were there significant 
differences between the oean standard deviations of the segregated and 
the desegregated groups. Thus, it seemed reasonable to employ the , 
^simpler formula advocated by Glass. 

In this set of studies thje duration of desegregation varies 
considerably. , In order to obtain an index of the effects of 
desegregation during, the first year of desegregation I first divided the 
effect size (E) by the duration (D) of the study to yield an effect size 
per month. In calculating the duration of the study I used the total 
number of months the study covered and subtracted 3 months for each 
summer vacation period that was included. Thus* the duration measure 
reflects only the number of months the students actually spent" in 
school. Next, I multiplied effect size per month by 8 to obtain an 
index of the effect size per year. 

Z X B 9t effect mlxc per year 

"The primary value of this index of effect size is that it avoids 
including together in subsequent analyses studies that vary in duration 
from 4 to 36 months. These scores were calculated iseparately for verbal 
and math achievesnent to determine if desegregation had differential 
effects on the two basic areas covered by achievement tests. Since some 
studies included more than one grade level, I calculated effect sizes 
for each grade and for each study as a whole so that comparisons could 
be made using grade or study as the unit of analysis. . .The effect sizes 
for grade are presented In Table 1. ■ 

Using this procedure for calculating effect size per year assumes that 
desegregation has linear effects over time* at least over the first 3 
years of desegregation. This Is the easiest and, I believe, the most 
defensible assump^tlon to make in dealing with the effects of 
desegregation over the first few years of desegregation. There are 
other plausible relationships* however. For instance* it might be 
predicted that If desegregation had positive effects* most of the 
benefits vculd accrue to the students during the initial year or two of 
desegregation after which little additional benefit would be derived. 
Alternatively, desegregation might be expected to have negative effects 
on achievement initially because of the negative conditions under which 
It so frequently occurs. Later, after adjustments have been made, 
desegregation might be predicted to have beneficial effects. The 
curvilinear ^nature of these predictions makes them difficult to apply to 
the present studies.^ In this set of studies the assumption of linearity 
appears to be reasonable in the case of math where tha correlation 
between the duration of the study and the effect size was marginally 
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Tftble 1 
Effect Sizes 
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significant (r - ,48, p <C ,10), In the case of reading, the 
correlation was not significant (r'= - •I?* ns), Krol's (1978) study of 
effect sizes for achievement is. consistent with the assumption that the 
effects over time are linear. 

The manner in which the results of these studies are presented is highly 
variable. In some studies the means and standard .deviations necessary 
to calualate effect sizes using the generic formula are reported, but in 
others the effect sizes must be calculated using F tests, T tests, 
analyses of difference scores or analyse^ of covariance. Strictly 
speaking none of the latter calculations is precisely comparable to the 
generic formula, since the. derived standard deviations are calculated 
from the overall variance. In cases where-only covariance analyses are 
available, the efiect sizes are almuost certainly overestimated* This 
means that the average effect sizes across this group of studies are 
only approximate estimates. 

Using studies as the unit of analysis, the average effect size for the 
first year of desegregation (8 months) was -17 verbal achievement, while 
the average effect size for march achievement was ,00 (Table 2), Using 
the effect size for each grade as the unit of analysis, the effects are 
,15 for reading and ,00 for math. Dropping the four studies from the 
sample set that I excluded has little effect on the results. Using " 
studies as the unit of analysis, the mean effect size for verbal 
achievement including all the studies in the set. is ,14 and for math it 
is ,04, These results appear to indicate that verbal achievement 
improves somewhat, but math achievement shows little effect as a result 
of desegregation. The difference between the X for reading achievement 
and the 7c for math achievement Is marginally significant (t = 1,96, p<: 
,08, Table 4), ^ 

One way to convey the magnitude of these effect sizes is to^consider 
what it would mean-, in terms of a test, such as the SAT or the GRE, that 
t)as a X of 500 and a standard deviation of 100, The effect for verbal 
achievement vould translate into a 17 point increase as a consequence of 
the first year of desegregation. The math effect would translate into 
no improvement. Another more appro?:imate way of thinking about these / 
figures would be to consider what the effects of desegregation are on / 
the average percentile ranking ^o^ Black students on a st^dardized test'. 
If desegregation imprc^eirVetiGi^i- achievement ,17 standard deviation 
units, this would raise| the average percentile rank of Blacks about. 5 
percentage points during the first year of desegregation. For math' 
there would be no changes, in percentile rank due to desegregation, r 

Why would desegregation affect the reading achievement of Blacks and not 
their achievements in matti? One possibility is that reading achievement 
may be improved, by direct exposure to the language usage and vocabulary 
of White students and teachers. Learning middle-class vocabulary and 
syntax may aid test performance. Such an. improvement would not be due 
to any changes In the quality of teaching, or changes in expectancies or 
achievement motivation, but simply to being able to understand the tests 
and the content of the questions better. Similar improvements would not 
be expected for math because there is no parallel to this type of 
indirectly learned information in the case of math. Here no improvement 
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Table 2 



Heana for ^ 
Uneorreeted Effeet Size and 
Effect Size Correeted for Duration of Study 



^Uncorrected 
X 

S.D. 
Corrected 
S.D. 



Using Classes as the Unit of Analysis 
Reading Math -\ 



.24 
.39 



.15 
.22 



.04 

.34 



.00 
.20 



Uncorrected 
I 
S.D. 

Corrected 

S.D. 



Using Studies as the Unit of Analysis 
Reading Math 



.24 
,35 



.17 
.22 



.06 
.25 



OOv 

.16\ 
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Table 4 



Reading vs. Hath* 



Uncorrected 
(Studies) 

Corrected - 
(Studies) 

Uncorrected 
(Classes) 

Corrected 
(Classes) 



Reading Effect Hath Effect 
Size 



.21 
.U 
.21 
.12 



Size 
.06 

.00 

.03 

.00 



t df p 

1.33 13 na 

1.96 13 .OB 

2.27 2H .OH 



2.52 2ti 



.02 



*The Syracuse study is excluded from this analysis because 
it did not include oath achieTement. 
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would be expected unless there were changes In the quality of 
instruction , or the students* expectancies or achievement motivation 
Increased, 

In this set of studies, the magnitude of the effect sizes Is unrelated 
to the region In which the studies were done, the size of the cities In 
which the studies were done, and the size of the samples (Table 3), 
There Is a marginally significant negative correlation between the grade 
the students were in when they were desegregated and the size of the 
effect for reading achievement (r = -,33, p <: ,10), The relationship 
between grade and effect size is cot significant for math (r = ,22, ns) , 
,For reading this suggests that younger students benefited more than 
older students from desegregation. One explanation for this 
relationship is that exposure to White students (and Id some cases. 
White teachers) may benefit students who have had little previous direct 
or vicarious contact wlth^Whites, This benefit probably consists of 
exposure to the type of vocabulary that achievement tests measure. 
Older students who have had more direct and vicarious contact with 
Whites may benefit less from ex20&ure to Whites in desegregated schools 
because they have had more expos^e to White middle-class language usage 
and vocabulary. 

The Correlation between the year the study was done and the size of the 
effect for reading is also marginally significant (r « -,49, p ,10, 
using studies as the unit of analysis). The correlation between the 
year the study was done and math' achievement is not significant (r - 
-,32), It is not clear why this effect exists for reading,- One 
possibility is that the early studies tended to be of voluntary 
desegregation where only select students participated. These 
desegregation programs may' have made special efforts to help ttie - _^ 
Incoming students and these students were probably highly motivated 
to succeed. In contrast f students in mandatory desegregation programs 
and later voluntary' programs may have received less special treatment 
and may not have been as motivated to learn. However, the effects of 
special treatment would be expected to affect both reading and math, and 
there was no relationship for math, althc^ugh the direction of the 
correlation is the same. 

It also appears that the effect size for reading was larger in school 
districts where the desegregation was voluntary rather thanNnandatory 
(X « ,21 voluntary, S = -,03 mandatory). While this differe^e is 
statistically significaai (t = 3,i5, p <: ,05, using studies^'as the 
unit of analysis and t)^ corrected effect sizes as the dependent 
measure) , the number of districts in which desegregation was mandatory 
is so small (n = 2) that these results may not be reliable. The effect 
for math was not significant (t «,23, ns) , The most likely explanation 
for these effects is that the students who particip^ated in desegregation 
voluntarily were more motivated to get to know other students. This 
informal contact would have enabled them to acquire verbal skills that 
could have affected their test performances, but it would not. have . 
enabled them to acquire math skills that affect^ test performance, 

I vould like to argue that none of the relationships regarding effect 

size, grades year, city size, region, or type of 'desegregation 

should be regarded as conclusive^ because the effect sizes themselves are 
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Table 3 t 

Correlations of Corrected Achievement Scores with 
Grade, Year, City Si^e and Saople Sire 

Reading M*^** 
&y Classes By Studies By Classes By Studies 



Grade -.33»» 
Tear 



City Size 
Saopie Sise 



• p ^.05 
• •p <.10 



1 50 



.22 



-.18 -.21 -.18 -15 

-.28 -.38 -.05 -.17 



unreliable. Even the overall effect sizes that were obtained may not be 
meaningful. Given the variability in the effect sizes in these studies, 
the confidence limits are rather broad. The 95% confidence limits (the 
range within which the true poj>ulation Ic is likely to fall, with only a 
5% probability of being mistaken) for verbal achievement are ,04 to ,30, 
and the 95% confidence limits for math achievement are -*09 to +*09, 
Thus, in the case of reading achievement we can be reasonably confident 
that desegregation has an effect, although it^may be very small indeed* 
In the case of math, desegregation appears to have no effects. 

There are other reasons why the average effect sizes should be regarded 
with more than a little caution. In those studies, involving multiple 
grades it is possible to examine fluctuations in the standard deviations 
of the students' achievement scores. For instance, in Rentsch's study 
the range in standard deviations for the verbal scores is 9,57 to 13,14, 
and the range for math scores is 6*52 to 13,37, Obviously, when these 
standard deviations are used to calculate effect sizes (using the 
generic formula) the magnitude of the effect size wilT depend on the 
standard deviation that is u^ :d. If the standard deviations are 
unstable, then the effect sizes will be correspondingly unstable. The 
lack of stability in^ standard deviations tends to be a problem with the 
studies where the sample sizes are small,^ 

One reason that the studies with small samples have variable standard 
deviations consists of sampling problems (e,g,, non-random sampling), v 
Fluctuations in standard deviations within studies may also occur as a 
consequence of variable conditions during test administration. Anyone 
who has given tests to elementary students is aware of how difficult it 
is to maintain standardi^ed procedures. Large sample sizes compensate 
somewhat for this variability in testing conditions, but most of the 
studies reviewed here did not use large samples. 

Even if the standard deviations were stable, the small sample sizes of 
many of these studies would result in means that may not be accurate. 
In order to be accurate to within ,5 standard deviation units of the 
true population X, a sample size of 15 Is required. To be accurate to 
within ,1 standard deviation units, requires a sample of 384, Thus, the 
mean values reported in the studies with small sample sizes are not 
likely to be 'measured accurately enough to provide reliable effect 
sizes. If there were a sufficient number of these samples, the errors 
of measurement would cancel each other out, but the number of samples is 
not large enough in this set of studies to lead to confidence in the 
summary* figures concerning effect sizes^ Also, the substantial 
variability in effect sizes suggests that the mean effect size may be 
distofted by extreme scores and indeed the effect size for verbal 
achievement is lowered to *13 if the median is used as a measure of 
central tendency rather than the mean. If the effect sizes were 
corrected for the unreliability of the achiev^nent tests this would also 
lower the estimate ^of the verbal achievement effect size. 
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Anot)i&r^«ason that the av^age effect sizes should be viewed with 
cai^ioiiy:-c(JncQrns methodological problems with the studies. While these 
studies ^^(S^r^chosen because they are the best ones available, they are 
not without their defects. The list of potential defects is a long one. 
Threats to internal validity include those already mentioned, small 
sample sizes, non-random samples, and fluctuations in standard \ 
. deviations (suggesting unreliability of measures). In addition,; the 

quality of the measures of achievement varies (some use measures; 
^developed within the district, others use tests standardized on White 
-TTopulations) , attrition varies considerably , a cross studied and threatens 
the validity of studies where it is high, and the segregated control 
groups are often fof uncertain comparability to the desegregated groups. 

Threats to external validity are comprised primarily of concemsiwlth 
the non-'rep reseat a tiveness of these samples of ^Black students and of 
this group of studies. Only students who are in desegregated schools at 
the end of the /study are included in the posttest and often in the 
pretest X'^, ^Tsually students who stay in the program are not compared 
to those who drop out to determine if they are different, TJiu:s» |we 
\ cannot be confident that the samples of desegregated student s^^iil^these 
\studies are /representative of Black students generally, A.lso^'ftfie 
studies are* mostly of voluntary desegregation in medium to large \ 
northern cities. The degree to which it is appropriate to generalize 
these results to maudatory desegregation in other regions of the cQuntry 
or Co small cities and rural areas is unclear. 
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Glass in discussing meta-analyses as a research method suggests 

that **Respect for parsimony and good sense demands an acceptance of the 
notion that imperfect studies can converge on a true conclusion" 
(p,356), \Hls argument relies on an example in which a set of studies ; 
are "similar in that they show a superiority of the ercperimental over \ 
the controlXgroup" fp,356). However, this argument may not apply as 
' forcefully to a set of studies, such as those on the effects of ^\ 
desegregatiori\ on Black achievement, in which the results are variable 
rather than similar. Under these circumstances, the variability in 
results may be \interpreted in terms of methodological problems as 
parsimoniously as in terms of more substantive causes, 
♦ \ 

A Basic Problem In Evaluating Desegregation 

\ 

/Perhaps the most fundamental oversight of the social scientists involved 
in the Brown trial was In not giving due consideration to the manner in 
which segregation would be eliminated, Th^^were not alone 

; in this oversight, even the lawyers for the NAACP did not consider this 
problem In detail unti3\after the firsr ^Brown decisiou in 1954, The 
Justices of the Supreme ^Couft were^vague in their recommendations saying 
in the second Brown decision in 1955 only that segregation should be 
ended with "§11 deliberate' speed" (Kluger, 1976 , pp, 714-747) • When 
desegregation began to be loiplemented 10 years after Brown , the forms it 
took were as varied as^ the coimminities in which it took place, 1 
believe it is this complexity more than any other factor that accounts 
for the diverse results that have been observed in studies of the, 
effects of desegregation on achievement. The diversity of desegregation 
programs is so great as to render the worS without a precise meaning. 
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Let me be sp;eciflc about this coinplexity , although it is familiar to 
anyone who has studied the problem. Each community starts with its own 
unique history_of_ relations.between. the races including when Blacks and 
Whites settled there, the origins of members of these groups, the social 
class structure of the groups, the degree of residential segregation and 
so on. The communities vary along such potentially important dimensions 
as size, region of the country., ratio of majority-to-minority group 
members, presence of suburbs and private schools to which Whites may 
flee, and funding for public schools. The desegregation programs 
implemented in these communities have their own unique history of 
litigation and decision making by school boards and other public 
officials. The programs themselves vary in the techniques used to 
create desegregation, some programs ar^^ voluntary but most are not, the 
plans may Involve voluntary cross-d_istrict busing, ^pairing, the use of. 
magnet schools, the closing of some (usually Black) schools, and the 
mandated busing of students (usually Black students) . The desegregation 
of teachers may or may not accompany the desegregation of students and 
the amcura of preparation teachers are given for desegregation Is 
variable. Additional curricular changes may occur at the same time as 
desegregation, the age of the students included in desegregation plans 
varies, the speed with whicfi a plan is implemented varies, community 
opposition varies as does the amount of Whi^e flight, the ratio' of 
TDajority-to*minority ^ttudents differs from conanunity to community as do 
the social class backgrounds of the students atid the quality of their 
predesegregatlon educational experiences. As long as this list seems, 
it is surely incomplete. VThat these differences mean is that comparing 
the effects of desegregation across communities is extraordinarily 
difficult. It is possible to use quantitative measures to examine the 
effects of some of the factors in this list, but the majority are more 
difficult to study and compare/ ' 

the Effects of Desegregation on Self Esteem and Race Relations 

The soctel scientists who participated in the Brown trials believed that 
segregation has negative effects on the self esteem of Black stud^ents 
and on relations between the races, as well as having negative effects 
on achievement. One of the clearest presentations of their views comes 
from the statement, that 35 social scientists filed as an Amicus Curiae 
brief in the Brown trial. 

" Segregation, prejudices and discriminations, and their 
social concomitants potentially damage the personality of 
all children Minority group children learn the inferior 

'status to which they are assigned ... they often react with 
feelings of inferiority and a sense of personal humiliation 

Under th«:se conditions, th^ minority group child is 
thrown into a conflict with regard to his feelings about 
himself and his group. He wonders whether his group and he 
himself are worthy of no more respect than they receive. 
This, conflict and confusion leads to self-hatred ... 

Some children, usually of the lower soclo-e^conomic 
classes, nay react by overt aggressions and hostility 
directed toward their own group or members of the doaiinant 
group." (Allport et al. , pp. 429-430) 
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The social science brief and testiDony in the individual trials leading 
up to Brown indicate that it was anticipated that ending segregation 
would remove the stigma of inferiority that was forced on Black 
children. 

Self esteem . The effects of desegregation on self esteem appear to ^be 
less favorable than the effects of desegr^ation on achievement. In my 
earlier review (Stephan. 1978), I found that desegregation led to 
decreases in the self esteem of Black students in 5 of 20 studies and 
that 'there wer^ no studies indicating that desegregation increased Black 
self esteem. As was true for the stu:dies of the effects of 
desegregation on achievementf the majority of these studies have been' 
concerned with the effects of desegregation over a period of 1 year or 
less. One. study that examined the effects over a longer period of time 
found that-TJhtie-^lack- s^if— est-eem-^ rebounded to 

predesegregation levels during the second year (Gerard and Miller, 1975), 
Subsequent studies of Black s&lf esteem, .including my own (Stephan and 
Rosenfield, 1978), have not changed this picture much. My conclusions 
regarding the e^ffects of desegregation on the self esteem of Black 
students are consistent with those of other investigators (e,g, , Banks* 
1976; Epps. 1975; Gordon. 1977; Shuey. 1966), 

It appears that the social scientists who participated in Brown used an 
invalid assumption as a basis for their argument that desegregation 
would increase the self esteem of Black students. Undoubtedly 
segregation stigmatizes Black students, but this stigma is not reflect«:d 
In the self esteem of Black students. Studies of segregated Blacks and 
Whites show that Black students have self esteem levels that are similar 
to or higher than White students in more cases than they have lower^ self 
esteep (see Porter and Washington, 1979. and Stephan and Rosenfleldt 
1979»\for reviews). These studies have employed questionnaire measures, 
of self esteem rather than Indirect measures such as the .doll tests upon 
which the social scientists' statements in Brown were based. The 
Indirect measures may have been tapping attitudes toward Blacks and 
Vhites as ethnic groups. There is considerable evidence indicating that 
young Black children have less favorable attitudes toward Blacks than , 
toward Whites (Williams and Morlandt 1976), 

If segregated Black students do not have low self esteem, there is 
little reason to expect that desegregation would increase self esteem. 
In factt" their arc several compelling reasons why decreases ■ in self * 
esteem might be expected. For instance* social comparison with White 
students who are academically better prepared than Blacks could lead 
Blacks to evaluate themselves negatively. Likewise, the 'loss of status 
and power that occurs when Blacks represent a minority of the student 
body in desegregated schools could also lower the self esteem of Black, 
students. In addition* negative evaluations by ethnocentric White 
students could adversely affect the self esteem of Blacks, 

Attitudes , The social scientists in their brief were also hopeful that 
contact within the schools would improve intergroup relations, 

j "Under certain circumstanc&s desegregation ,,, has been 

observed to lead to the emergence of more favorable 
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attitudes and friendlier relations between races, There is 
less likelihood of unfriendly relations when change iS 
simultaneously introduced into all units of a social'* 
Institution ,,,Zand when there Is consistent and finn 
enforcement of the new policy by those in authority, 
,,, These conditions can generally be satisfied In , , , public 
schools," fpp, 437^38> ^ 

The social scientists appreciated the fact that contact alone would^not 
be sufficient to iniprove intergroup relations. Their statement notes 
several preconditions, for favorable change; equal status between the 
groups* and firm, thorough Implementation of dectegregation. It is 
likely that they were aware of other relevant factors such as those 
oentioned by Williams (1947) a half dozen years before the social 
science statement was drafted: 

"Lessened hostility will result from arranging Inter- 
group collaboration* on the basis of personal association of 
individuals as functional equals on a corcmon task ,iointly 
accepted as wortWhile," fWilllams, 1947> 

The data on the Initial effects of desegregation on race relations 
suggest ^that the social scientists* caution was well founded, Iii an 
earlier review of "the data, X found that desegregatloti increased Black 
prejudice toward Whites in .almost as many cases as It decreased 
prejudice (Stephan, 1978*)* The results for Whites were somewhat more 
negative. Recent studies; Including my own> which also Indicated that 
desegregation does not improve race relations (Stephan and Rosenfield* 
1978), have not led me to revise these conclusions (e,g,. Bullock* 1976; 
Campbell* 1977; Patchen, 1982; Sheehan, 1980>, The quality of these 
studies Is not^as high as the better achievement studies* and there Is 
such a smiall number^ of them that these conclusions car only be regarded 
as tentative. My conclusions are, however* generally consistent with 
those of other investigators" fArmor* 1972; Epps, 1975; St, John, 1975; 
Schbfield* 1978; Weinberg* 1970), 

'■^ ^- * < 
In the year since Brown ' the contact hypothesis has been elaborated and 
refined. These elaborations are helpful in understanding why 
desegregation often has nofhas a positive effect on race relations. 
Here are my own most recent statements concerning, the conditions under 
which contact improves intergroup relations, 

1, Cooperation within groups should be maximized and 
competition between groups should be minimized, 

2, Members of ingroup and outgroup should be of equal 
status both within and outside of the contact situation, . 

3* Similarity of group members on non-status dimensions 
appears to be desirable (beliefs* values* etc), 

4, Differences in competence should be avoided, 

5* The outcomes should be positive. 
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.6, There should be strong nortnative and institutional 
support for the contact, 

7, The intergroup contact should have the potential 
to extend beyond the immediate situation, 

/ 

8, Individuation of group members should be promoted, 

9* Non-superficial contact (e,g, , mutual disclosure 
of information^) should be encouraged, 

10* The" contact should be voluntary, 

11, Positive effects are likely to correlate with the 
duration of the contact, 

12, The contact should occur in a variety of contexts with 
a variety of ingroup and outgroup members, 

13, There should be equal numbers of ingroup and outgroup 
members, (Stephan, 1983) 



Desegregation rarely occurs under conditions that would lead to 
improvements in race relations* Instead, desegregation often occurs 
after there has been considerable community opposition from parents, 
administrators, school boards, and teachers. Thus, institutional and 
normative support for the contact is frequently low; the atmosphere 
tends to be competitive rather than emphasizing cooperation in pursuit 
of common goals; the statuses of Blacks and Whites often are unequal 
both outside the school (due to social class) and within the' school (due 
to unbalanced ratios of Blacks and Whites); the Black students are often 
not as well prepared academically as the Whites, so 

stereotype-confirming differences in academic competencies frequently 
occur; busing often limits out-of-school contact and the within-school 
contact that does occur is wore likely to be negative or neutral than 
posi-cive, and in mosD cases it will be superficial. Also, the contact 
is Involuntary In the case of court-ordered desegregation* 

Recent* research on the use of cooperative interethnic groups ^in * 
desegregated schools indicates that when the coQditipns specified above 
are met, Intergroup relations and self esteem iiaprove without any costs 
in terms of lowered achievement (e,g,, Aronson, Stephan, Sikes, Blaney 
and* Snapp, 1978; Cohen, 1980; Cooper, Johnson, Johnson and Wilderson, 
1980; De Vries, Edwards and Slavin, 1978; Weigel, Wiser and Cook, 1975), 
Other intergroup relations techniques involving multiethnic curricula, 
discussions of -race -issues,, and explicitly providing information about 
the cultures of different groups have also been found to improve 
intergroup relations in the majority of cases (see Stephan, 19J83; and 
Stephan and Stephan, 1983, for reviews). What these studies demonstrate 
is that while simply mixing students of different groups in desegregated 
schools does not imprcfve race relations, intergroup relations can be 
improyecl in desegregated schools by introducing special programs ' 
designed to achieve this goal. 
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Future Directions for Research In Desegregation 



I would like to see research into techniques to iisprove achievement* 
race relations* and self esteem continue. In addition^ there are 
several other areas where I thank research should also be done* One of 
the major problems with nearly all desegregation research is that it 
only covers the effects of the first year of desegregation* or at most 
the first two or three years of desegregation. There are almost no 
studies of the long-term effects of desegregation. We need to know not 
only vhat the long-term educational effects of desegregation are*. but we 
also need to know what the non-educational effects are* And we need to 
know the effects not only for Whites and Blacks* but also for other 
ethnic groups as well. Does school desegregation reduce segregation in 
other realms, such as housing; do minority students who have attended * 
desegregfited schools get better jobs and do they get promoted at a 
faster rate than students who attended segregated schools; and is 
subsequent political participation increased as a result of attending 
desegregated schools? 

Also* we need to know more about: the effects of desegregation on the 
communities that have undergone it. For instance* how do people in 
communities with well-established desegregation programs feel about 
desegregation now; are people who have attended desegregated schools 
more willing to send their children to desegregated schools than people 
who attended segregated schools; and what differences are there in the 
race relations of communities' with well-established desegregation 
programs compared to other communities? 

A third set of questions concerns the factors associated with successful 
desegregation programs* When desegregation goes veil* why does it work? 
One can imagine a wide variety of factors that* could be relevant* some 
.having to do with the community' in which it takes place* others having 
to do with the way administrators and teachers respond to desegregation* 
and still others with- the composition of the student body. The fact is 
that we know precious little about what differentiates successful from 
unsuccessful desegregation programs, / ^ 

Desegregation in Perspective 

IL would be impossible to present a comprehensive evaluation of the 
effects of desegregation in this short article. Instead* I have 
attempted to confine myself to some of the effects of desegregation on 
students, , However, the larger context in which desegregation occurs is 
of immense importance to an understanding of the meaning of 
desegregation. 

In order to put desegregation in perspective* we must consider the role 
that it has played in influencing relations between the races in our 
society. Since 1954* vast changes in race relations have occurred; many 
overt forms of discrimination have been eliminated* levels of prejudice 
have decreased* most minority groups have made economic advances, 
political participation by minority group members has increased 
dramatically* and more minority. group members are attending college. 
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School desegregation has played a role in these economic, political, and 
social changes, but it is a role that is hot \jrell understood and is 
little studied. Any analysis that abstracts school desegregation from 
its social context is neces^sarily incomplete*/ Unfortunately, we. are not 
now in a position to perform such an analysisl Given the difficulty of 
answering even a limited question like the effects of desegregation on 
Black achievement, it doesn't seem likely to |me that we will be in a 
position to do. an adequate comprehensive evaluation of desegregation 
anytime in the near future, f» \ * 

As we acquire more Information on the outcomes of desegregation^ we will 
be in a better position to base policy decisions on data* However, for 
the present, .it seems to me that we will have to continue to niake major 
policy decisions about desegregation on the| basis of competing values. 
Some or these values concern the goals of. public education^ in 
particular the degree to which the schools {should concern themselves 
with intergroup relations and the preparation cf students to participate 
In a pluralistic .society* Other decisions^^ that we will continue to have 
to make pit the importance of creating equal educational opportunities 
' against freedom of choice and freedom of association,'' Perhaps most 
importantly we will have to decide whether we value the elimination of 
segregation enough to continue the 50-year battle against it* Social 
science may be of less value in making these choices decision than in 
making choices about the best ways of implementing these decisions. 
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The purpose of the present paper is to analyze research on the Impact /of 
school desegregation on academic achievement. More specifically, the' 
particular emphasis of this paper is the comparison of the effects of 
desegregation with those of otJt4r factors in , the process *of school 
learning that have been recently synthesized. 

The paper is divided into three sections. The remainder of this first 
section discusses techniques and guidelines for research synthesis 
including meta-analysis. The second section presents a summary^ of the 
statistical analyses of research reviews of the l970's and a collection 
of meta-analyses of the 1980 's, which reveal the consistently potent 
productivity factors in school learning and which further Illustrate 
techniques and guidelines for research synthesis'. The third section ^ 
assesses selection criteria for studies of school desegregation and 
achievement* and compares. the effects of desegregation — as revealed by 
three recent meta-analyses-^vith the effects of the 
, educational-productivity factors. 

Research Synthesis : 

The present is an extraordinary time in the history of education because 
research syntheses are demonstrating the consistency of educational 
effects and are helping to put teaching and other determinants of 
learning on a sound scientific basis. Research synthesis is, an attempt 
to apply scientific techniques and standards explicitly to the 
evaluation and summarization of research; it not only statistically 
summarizes effects across studies but also provides detailed^ replicable 
rationales and descriptions of literature searches^ selection of 
studies, metrics of study effects, statistical procedures, and overall 
results as well as those that call for exception Vith respect to context 
or subjects by objective statistical^ criteria (Glass, 1977; Cooper 
Rosenthal, 1980; Jackson, 1980; Walberg & Haertel, 1980; Gl^ss, McGaw, & 
Smith, 1981; and Light and Pillemer, 1982), Qualitative insights may be 
'Usefully combined with quantitative synthesis (Light & Pillemer, 1982); 
and quantitative results from multiple reviews and* syntheses of the. same 
or different topics may be compiled and compared to estimate their 
relative magnitudes and consistencies (Walberg, . 1982) 

Research synthesis is not merely statistical analysis of studies, 
Jackson (1980) discusses six tasks comprising an Integrative review or 
research syxtthesis: specifying the questions or hypotheses for 
investigation; selecting or sampling the' studies for synthesis; codir^ 
or ^representing the characteristics of the primary studies; analyzing, 
or meta-analyzlng (Glass, 1977) or statistically synthesizing the study , 
effects; interpreting the results; and reporting the findings. 
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Although these tasks seem obviously necessary to encourage re^plication 
of .reviews* ^Jackson found only 12^^ut of 87 recent reviews in prominent 
educational, .psychological, and sociological' journals that provided even 
a cursory, statement of methods. The basic idea behind much good advice 
in Jackson's paper is that the methods of reyjLew and synthesis should be 
explicit to enable other investigators to atXe^ap.t 'to . Replicate the 
synthesis, ' ' ^ , ' . 

Explicit methods concerning quantitative synthesis, however^ ifievitably 
call for . statistics* and two are most often empl6yed— the vote count or 
box score, and-the effect size (Gla^sst 1977), n-^he vote count is easiest 
to calculate and explain to those who are unax^customed to thinking 
statistically^ it is simply the number of percentage of all studies that 
are positive, for example, in which the experimental exceed control 
-.groups or the Independent variable correlated positively with the 
depeindent variable, \ 

The effect size is the difference between the means of the^ experimental 
and control groups divided by the control group standard deviation; it 
measures the average superiority (or. inferiority, if negative) of the 
experimental relative to the '^cbtTtroV^groups^CSor cases' in which* these 
statistics are unreported^ Glass (1977) provides a number of alternate 
"estimation fomtulas) * If education had uniform ratio variables* such as 
tine and money as in economics-, or physical measures in natural sciences 
such as meters and kilograms*, effect sizes would be unnecessary; it 
could be said, for example* that the experimental groups grew ,42 
comprehension units in reading history on average* and the control group 
grew ,22 units without crude post hoc standardization for comparability 
required in meta-analysis. 

Effect sizes permit a rough calibration of comparisons across tests* 
contexts^ subjects, and other characteristics of studies* The estimates* 
however^ are affected by the variances in the groups* the reliabilities 
of the outcomes* the match of curriculum with outcome measures* and a 
host of other factors, whose Influences in some cases can be estimated 
specifically or generally. Although effect sizes are subject to 
distortions* many of which may counterbalance one another, they are the 
only means of comparing the size of effects in primary research that 
employs various outcome measures on non-uniform groups. They are likely 
to be necessary until an advanced theory and science of educational 
measurement develops ratio measures ' that are directly comparable across 
studies and populations, < 

Generalizability 

The generality of the results of the synthesis can be divided into 
questions of extrapolation and interpolation; Do the synthesized 



results generalise to other populations and conditions, particularly to 
those that have not been "Studied or for whom the results are 
unpublished? And, do the results generalise across populations .and 
conditions for which results are available? Extrapolation may be 
invalid beyond published studies because Journal editors favor positive, 
significant studies. Smith . (1980) estimates from several syntheses that 
l^mean effect sizes in unpublished work, mainly doctoral dissertations, 
are occasionally larger but average about a third smaller than those in 
published studies, * 

Rosenthal (l980) > on the other hand, shows that given the great 
statistical significance of collections published studies, the 
probability of null effects being established by unpublished studies is 
mlnli^l. Furthermore, both the low reliability of educational measures 
and low -curricular validity (correspondence of what is taught and what is 
tested on outcome measures) diminish the estimates of relations. between 
educational means and ends. Less than optimal reliability ^nd validity, 
which leads to underestimates of effects, probably more? than compensate 
^£or publication bias; but more empirical and analytic work is needed on 
these factors to determine their general and specific influences on 
synthesis results, * " 

* , - ^ ^ 

Interpolation 
A ■ 

The ^^fterpolation problem can be readily solved by additional 
calTculations, The most obvious questions in quantitative synthesis 
X concern the overall percentage of positive results and their average 
magnitude. But the next questions should concern the consistency and 
magnitude of results across student and teacher characteristics, 
educational treatments and conditions, subject matters, study outcomes, 
and validity factors in the studies. These questions can be answered by 
calculating separate results for classifications or , ^ 
cross-classtif ications of effects. 

The results may be compared by. objective statistical tests (such as 
T, F, and regression weights in general linear models). They permit 
conclusions on such matters as the overall effectiveness of treatments 
as well as their differential effectiveness on categories of students in 
various conditions and different ''outcomes, Notwithstanding the .frequent 
claims by reviewers . for differential effects on the basis of results of 
a few selected studies, most research syntheses yield results that are 
robust and roughly consistent across such categories. Such robustness 
is scientifically valuable because It indicates parsimonious, law-like 
findings; it is also educationally valuable because educators can apply 
robust findings more confidently and efficiently rather than using 
^ complicated, expensive procedures ^ tailor-made on unproven assumptions 
to special cases, ' ^ 



A number of useful methodological writings are available. Glass (l977) 
providea a concise introduction to statistical methods; and Glass, 
McGawj and Smith's (l981) book presents a comprehensive treatment, 
Jackson (1980) andXooper (1982) discuas tasks' and criteria for 
integrative reviews and research syntheses. Light and Fillemer (1982) 
describe methods for combining quantitative and qualitative methods, 
Ualberg and Haertel (1980) present a collection of eight methodological 
papera by Cahen, Copper, Hedges, Light, Rosenthal, Smith ai^d othera and 
thirty^five substantive papers mostly on educational topica. In 
forthcoming work, Larry Hedges of the University of Chicago and Barry 
McGaw of Murdoch University (Australia) offer firmer statistical 
and psychometric footings for quantitative synthesis! Important 
guidelines for research synthesis that may be found in these works are 
further discussed and illustrated in the remaining/sections. 

Educatio n al Productivity Factors- 

A Review of Reviews of Tea<^ , ; Effects - ■ 

The year 1980 marked a transitional period when investigators recognized 
the shortcomings of the traditional review and the advantages of more 
objective, explicit procedures for evaluating and summarizing research. 
Yet reviews still have a place, and much can be learned from them, 
Uaxman and Ualberg (l982) examined d9 reviews of teact:ing 
process*-student outcome research published during a recent decade that . 
critically reviewed at least three studies ^and two teaching Constructs; 
they described their methods, compared their conclusion, synthesized 
rhem, and pointed out the imp^llcatlons for future revievsi^ syntheses, _ 
and prior research, 

The 19 reviews reflect the inexplicit, varied, and vague standards 
revealed by Jacfcson*s (1980) analysis of 87 review articles in prominent 
educational^^sychological, and sociological journals,. None of the 
reyiews,^fbr example, described their"search procedures, and only one 
stated-^plicit criteria for Inclusion and exclusion of primary studies. 
Comparative analysis, of the studies, moreover, revealed that the 
reviewers failed to search diligently enough for primary studies or to 
state the reasons for excluding large pjarts of the research evidence. 
Among the five reviews that covered positive reinforcement such as 
praise and feedback In teiaching, only six studies were covered in the 
most comprehensive review in contrast to the 39 listed in Lysakowski and 
Walberg's (1581), synthesis, Such arbitrary selection of sinall parts of 
the evidence, of course, leaves the reviews open to systematic bias and 
means that the reviews and their conclusions cannot be replicated in a 
strict sense because t^eir methods are undescribed. 

Although the reviews purported to be criv:ical, their coverage of the 33 
standard threats to methodological validity (Cook h Campbell, 1979) was 
spotty and haphazard. In 95,4 percent of the possible instances, the 
reviews ignored specific threats. External validity (interaction of 
teaching treatments with selection, setting, and history) was relatively 
well covered, perhaps reflecting the search and claims for 
aptitude-treatment interactions of the 1970* s; but the serious problems 
of internal validity, such as reverse and exogenous causes in 
correlational studies', wer« almost wholly ignored. Indeed, there 
appeared an odd tendency to select correlational studies rather than 
experiments for review. 
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Despite these problemSf however* a statistical tabulation of the 
co^nclusions of the reviews shows substantial and 
statistically-^signif icant agreement that fdve broad teaching 
constructs — cognitive cues* motivational incentives* engagement* 
reinforcement, and management and ^climate — are positively associated with 
student learning outcomes (see Table I) , These tabulations* moreover* 
are in close agreement with quantitative syntheses of large* systematic 
collections of primar>' studies discussed in a subsequent section. 

Current Research Syntheses 

To characterize quantitative syntheses of educational research completed 
since 1979* sixteen were fx>und in 1982 by scanning publications of the 
American Educational Research Association and writing to the members of 
"the invisible college" of about 100 scholars that meet annually to 
present and discuss research on teaching, A more systematic search in 
late 1982 3ing Dissertation Abstracts* Social Science Citation Index* 
Education Index, computer retrieval* and references in recfent 
publications indicates that these syntheses plus those discussed in 
subsequent sections of this chapter represent about three-fourths of 
those completed in education thusfar in the 1980s, (An analysis of a 
more complete corpus is underway by the present author and colleagues* 
but the increasing numiber?t>^ syntheses makes exhaustive coverage an 
elusive goal,) 

_Xable .2 suggests a number -.of.,instructiye points for both educational 
practice and research synthesis, . It provides* for example* an 
empirical answer to the coincidence of vote counts and effect sizes. 
Every mean effect size that was positive also had a vote count greater 
than 50 percent every negative effect size had a vote coi:nt less than 
50 percent, Thiis* as may be ejected from normal distributions^ 
consistently positive findings will yield positive average results (the 
next section shows that much of the variance in effects can be 
predicted by regression from counrs) , The likely explanation for the 
uniform association is that strong causes produce results corisisteht in 
sign. Indeed* the only cases in which the association can be reversed 
are skewed distributions in which a few very strong positive results are 
sufficient to pull the mean above zero from a cluster of small effects* 
more than Half of which are negative (or vice versa). 

The first two syntheses grouped under Teaching Strategies in Table 2 
show fairly close agreement with respect to the consistency of 
cooperative learning, Johnson and others (1981) categorized their 
results by comparisons of four treatment variations (cooperative* 
competitive* group competitive* and individualistic)* whereas Slavin 
(1980) categorized his results by outcomes. Cooperative learning 
obviously produces superior results; but it would be useful If Journal 
editors would allow research synthesis space zo report average resi:lts 
by more standard classifications of independent and dependent variables 
and study conditions to facilitate compariso^ns of replicated S3aitheses 
such as these two. 
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The next two syntheses raise ImportanCf unresolved tuethodologJcal 
questions. Becker and Gerscen's (1982) synthesis Indicated a small 
average etfect of direct instruction in several site;;, but all effect 
sizes came iron the same study* Although teachers in the various sites 
may have been independent actors, methodolof^ical bias can make the 
effects non-independent from a statistical point of view, ?nd 
Independent replications by different investigators would be in order to 
a provide a more definitive answer. Pflautu and others (19S0) found no 
average superiority of different reading methods but a substantial 
advantage in learning outcomes- of experimental over control groups no 
matter what the reading method employed* . Although Hawthorne effects 
could be discounted" by the synthesis, the increased energy and attention 
devoted to task^ by teachers in experimental groups rather than putative 
treatments themselves may partly account for superior results in 
teaching-methods and other educational studies. 

Table 2 Includes two rough replications that indicate subst^intial 
agreement in results despite large variations, in study search, 

"selection, and numbers* Hansford and Hattie's (1982) and Findley and 
Cooper's (1981) syntheses of correlations of self-concept and locus of 
control with achievement and performarxe differ only slightly in the 
second decimal place in both the vote counts and average correlations. 
Carlberg and Kavale's (1980) and Ottenbacher and Cooper's (1981) 
syntheses agree that the effects of nainstreaming (federally-encouraged 
efforts in the United States to nix regular and cognitively, emotionally, 

.and physically handicapped children in the same classes) are 
Inconsistent and probably near zero* 

Two syntheses show curvilinear effects of independent variables on 
educational outcomes. Smith and Class (1980) found that the benefits of 
reduced class size are larger at tht smaller ranges of one to 10 members 
than they are at higlier ranges; for example, the measurable cognitive 
and affective outcome differences between classes of 20 and 60 appear. 
\trivial* Similarly, Williams and others (1982) found decreasing 
. achievement with departures from 10 weekly hours of leisure-time 
television viewing such that estimated differences in achievement betveen 
children who watch about 30 hours — an average number — and 60 — a large 
amount—are miniscule* 

* Other tjffects are summarized in the table, and the reader is referred to 
th.e original syntheses for details that are not discussed here. 
Overall, the results indicate a large range of effects, which, if 
replicated in further primary research and syntheses, would have fairly 
definite implications for choosing policies and practices that seiem 
likely to have consequential effects on raising educational outcomes. 

The Michigan Program 

Chen-Lin and James Kulik lead a vigorous group of research syrtthesists 
>t the University ot Michigan, which included Peter Cohen, now of 
Dartmouth. The group has be1;n unusually productive of high-quality.. 
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syncheses first in ^higher educacion and later in secondary-school 
research. Personal communicacxc^ns with the proup reveal chat their team 
approach, much like chat described by Shulman and Tamir (1973) in the 
Second Handbook of Research on Teachinfi , accounts in part for the 
quantity and quality of work* 

James Kulik kindly prepared^ Table 3 according to the present author'^ 
specif icationfi. It shovrs the results of eleven syntheses completed ty 
the Michigan group by the end of 1981, Like the sixteen syntheses by. 
other investigators discussed in the last section, those in Table 3 show 
a number of consistent moderate to large effects that can help to put 
high school and college teaching on a firm scientific basis* 

Kulik^s results also permit an estimate of the mean size of effects from 
vote counts. The regression equation, ES + -*403 + *008 (% Positive), 
accounts for 76 percent of the variance in the eff ect^sxzes. The 
corresponding equation for the syntheses in Table 2 for which both 
indexes are available, ES ^ -,76l + *015 (% Positive), accounts for 59 
percent of the tiff ect^size, variance (the correlational results assume 
both causality and a one^unit increase in the independent variable).^ 
Both equations forecast near zero tffect si^es for vote coiftxts of 50 
percent; but the higher slope for the results in Table 2 forecast larger 
effects; than do the Michigan data; at vote counts of 75 percent, for 
example, the respective forec&sts are *36 and ,20* Thus the size of the 
regression slope is unstable acVoss samples, and more intensive, analyses 
of the complete corpus of syntheses are in order. 

The two data set$ also pencit separate empirical estimates of the 
distributions of vote counts and effects* The mean (and standard 
deviations) of Michigan and other estimates of the vote counts are 
tespectively 67 and 64 (and 19 and 16); the mean effects are 
respectively *17 and ,22 (and .19^an<J *3l)* Assuming normal 
distributions of effects, empirical norms for vote counts and effect 
sizes can be set forth on the basis of the averages of these statistics; 
for example, the middle two-thirds of the effects in the recent 
educational ►research sampled range from about -,05 Co ,45, It could be 
said that effect sizes of .20 are average, and those above ,45 are large 
and exceed about 84 percent of those typically found in educational 
re<;earch. Similarly, vote counts of 67 and 85 percent might be 
provisionally taken as average and large. These norms are, of course, 
very rough and preliminary, but they are' based on empirical results 
rather than opinion and may be useful in gauging present and future 
r-^sults untiLJ^rger normative samples are analyzed* 

Syntheses of Bivariate Productivity Studies 

A group at the University of Illinois at Chicago has concentrated on 
synthesizing research on nine theoretical constructs that appear to have 
consistent causal influences on academic learning: student age or 
development level, ability (including prior achievement), and 
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motivation; amount and quality o£ instruction; the psychological 
environments of the class, home, and peer group outside school; and 
exposure to the mass media (Walberg, 1981), The group first collected 
available vote counts and effect si^es in the review literature of the 
1970*s and then conducted more systematic syntheses directly on the nine 
factors, ^^This section summarizes both efforts. 

Synthesis of reviews of the 1970^s , Walberg, Schiller, and Haertel 
(1979) collected reviews published from 1969 to 1979 on the effects of 
instruction and related factors on cognitive, affective, and behavioral 
learning in research conducted in elementary* secondary, and college 
classes and indexed in standard sources. The vote counts for the corpus 
of reviews are shown in Table A, 

The vote counts should be cautiously interpreted because not only may 
journal editors more often select studies with positive results but also 
reviewers may select positive published studies for summarization. 
Neither editors nor reviewers ordinarily state explicit policies on 
these important points. Subsequent, more systematic syntheses, 
nonetheless, have generally supported traditional reviews; and it would 
be wasteful to ignore the labors of the last decade of effort, even 
though it may only be considered a starting point for subsequent work. 

Notwithstanding the possible double bias in the vote counts (see earlier 
sections on counter-biases), the results in Table 4 are impressive, A 
majority of the variables in the table were positively associated with 
learning; in 48 or 68 percent of the 71 tabulations, 80 percent or more 
of the comparisons or correlations are positive. Although all of the 
variables are candidates for synthesis using systematic search, 
selection, evaluation and summarization procedures, it appears that the 
1970 's produced reasonably consistent findings that are likely to be 
confirmed by more comprehensive and explicit methods of the present 
decade. 

S yntheses of Productivity Factors , The Chicago group also carried out 
syntheses of the nine factors using methods discussed in previous 
sections of this chapter. The National Institute of Education supported 
the syntheses of learning ' re^search in ordinary classed grades 
kindergarten through twelve, A separate grant from the. National Science, 
Foundation* on science learning, grades 6 through 12, permitted more 
exhaustive, intensive search for unpublished work and an advisory group 
of science educators and research methodologists as well as a 
semi-independent replication of the ;cesults for several of the factors, 
A summary of the findings is shown In Table 5, 

All of the effect sizes (including mean contrasts and ccn:relatiors)" are 
in the expected direction. The mean effects for the two samples of 
studies are similar in magnitude, which suggests generality or 
robustness of effects across more and^ less intensive methods of 
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Table 5 

Corretonons and Efftct Sji^j^V Sinr Futicn 
r' m Bftaiitm to Srhoot Lfaming 



Number 

Of 
Studies 



Rrsults and Oimmrm 



lnstrua»on 
Amou nt 



Qualitt 



Social^ pi vc ho logKal Envtronmeni 
Educational 12 



Homr 



.Mcdu-TV 



Peer ffoup 



Apniudr 

Agr-dcvdopmrnt 9 



AbiJii) 



51 CorntUttons ranf;r frum A%\n .71 t^ith a median of ^40. partial 
corirlaiions controlling fur abiliiy. lociorconomic jiatus. and 
other tartables ranfce frum M in .6i) utih a median of .35 
95 TKe mean of effect sixes for reinfurcemeni in 39 i^tudtet is 117. 
sug^stmg a 3tt^point percennte »dtanuge o^er control f(roups. 
although girls and students in special schools mtght be somewhat 
more benefited, thr mean effec' sixes for cues, partitipation. and 
corrective feedback in 54 studies is .97. suggesting a 53*point 
adt^ntage. The me^n ef feci sixe uf similar vjrubles in 18 tctence 
Studies is .etr ' 

On l9outcCmes. mxuK psychological climate t^rublesjdded from I 
to54tmedijn " 2(P^ ) to^countable t^runcein tcjrntng beyond 
abilitt and pretests: the signs and magnitudes of the correUttons 
depend on specific scales tseeT^ble ] ). level of aggregatiun (cUsses 
and schools higherK natif>n. and grade levd^Uter grades higher); 
but 'not on Sample sixe* subjeci matter, dum^m uf learning 
fccgniiivei affectite. or behjtinrjiK or statistical adjustments fur 
abJitt and pretests 
18 Correlations of achievement, abilitt. and motivation with home 
support and stimuUiiun ratVge from to .82 t»ith a median of 
.37* multiple corrections range fmm to .81 ttith a median of 
.44. studies of b(ASand girls and middte<cUss ehildren incuntrast - 
to mixed groups shut* h'igher coVrelations (sociat classes 
correlations in 100 studies, b^ contrast, hate a median of .25)^ The 
median correlation^ for three studies <A home ent ironmetit and 
learmng in science is .5i, 
23 274 correlations of leisure*time television viett.ing and learning 
rangetj from - .56 to .35 *ith a median of .06.althuugh^ffects 
appear increa^ingK deleterious from ItJ to hours^ a week and 
appear stronger for %tr\\ and high*IQ children. 
10 The mediah cot reUttfjn uT peer gruupor fnend characteristics such 
as socioeconomic ttatus and'educatiunal aspirations with 
4Chietement-test scores. ^urse grades, and educational and 
occupational aspirations b .24; coi^Lations are higher in urban 
settings and in studies of students tthirreported aspirations and 
achievements of fronds. The median <>1 two sciences Stodges b. 24. 

Correlations b^ti^een P*aget develupmental level and school 
achievement ran^e from .U2 tu .7| with a median uf .35.The mean 
correlation m sciences is .40. 
10 From 596 correlations ttith Seaming, mean verbal intetltgence 
measures are highest (mean * .7^) IdlcH^ed bv toul ability iJ7 1 >« 
nonverbal C I, and ^ui^ititatite (.60): conflations *fiiK 
4 achievement test scures ( X0> ^rr higher than :h(M with grades 
( 57). The mean ^b>lit>*learninK correUiion'in science is .4(t, 
40 ' Meancorrrlatiun with learning is .34, correlations were higher for 
older samplesand fnrCunibinatifjnscjf subjects (mathematics) and 
measures, but did ni>t depend <jn t^pe of motivation nur the sex of 
' the samples Thr mean three ttuilies ^n tcierxe i^ .93. 
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synthesis. In particular, the syntheses of quality of insrructiori 
including cues, participation, and reinforcement of about 1,0 and -6 in 
general grades ,K-12 and in science grades 6-12 support the conclusions 
of the 19 reviews discussed in a previous section^ (see also Table D- 
Despite these corroborations of findings, of course, independent, 
replications of the syntheses as well as new and probing experimental* 
studies^ are needed^. 

Syntheses 6f ^:ultivariatc Studies 

The Chicago group also conducted multivariate analyses of the 
productivity factors in samples of froia two c6 three thousand 13- and 
i7-year-old s:^dents who participated in the ntathentatics, social 
studies, and science parts of the National Assessment of Educational 
Progress^ (soe» for example, Walberg, Fascarell^i, Haertel, Junker, and 
Boulanger, 1981, 1982)* These survey analyses complement small-scale 
correlational and experimental studies in .providing on representative 
national samples" data on fairly comprehensive sets of the productivity 
factors, each of which may^ be statistically controlled for the others in 
multiple ^regressions of achievement and subject-matter interest- 
Such analyses allow a simultaneous assessmu:it ox qualities ana amounts 
of instruction and the other factors In the production of learning* 
Since the factor levels are reported as experienced by individual 
students, the analyses are sensitive to micro-variations in the multiple 
environments of the school^ peer-group, home, and mass media to which 
each student is exposed- 

Although tlie sets of„ variables available In the National Assessment can 
be used to assess possible exogenous causes^ because they are measured 
and can be statistically controlled in regression equations, the 
measures are cross-sectional for individuals. Therefore, they cannot 
effectively .rule out reverse causation such as learning as a cause of 
motivation and more stimulating teaching* Another shortcoming of the 
data is that parental socioeconomic status serves as a proxy for ability 
and prior achievement - 

As pointed out above, nonetheless, the strengths of the National 
Assessment dara complement those of small-scale bivaxi^a^f studies that 
typically control- for only one or two of the factors. If syntheses of 
botli data sources point in the same direction, then more C(^nfidence can 
he placed in the conclusions. 



Table 6 shows that the^ factors, when controlled for one another, are 
surprisingly consistent in sign, significance, arA magnitude across 
subject matters, ages, operational measures, of the factors, and 
independent natldnal./samples* The median standardized regression 
weights and squared multiple correlation^J., shown in the last, row, reveal 
the small to moderate effects of the factors when controlled ^ov one 
another iind sizable amounts of variance accounted for even without 
ability and prior achievement measures'* 
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Syntheses of Open Education Research 

Open- educat ion is an elusive concept, now dismissed by many .jeducators , 
but one that research synthesis now illuminates. The history of efforts 
to synthesize its effects Is instructive atout: the dangers of. basing 
conclusions, policies, and practices on single studies; replication and 
improved methods of syntheses, and a shortcoming of much of the research 
discussed above that employs grades and standardized achievement as the 
sole outcomes of teaching, 

Fron the start, open; educators tried to encourage educational outcomes 
that reflect school-board goals such as cooperation, critical thinking, 
self-reliance, constructive learning attitudes, life-long learning, and 
other goals that evaluators seldom measure. Raven's (1981) Nummary of 
surveys in Western, countries including England and the tJnited States 
shows that educators, parents, and students rank these goals far above 
standardized test achievement and grades, 

A synthesis of the relation of conventionally-measured educational 
outcomes, and adult success, moreover, shows their slight association 
(Samson and others, 1982), Thirty-three pos.t-1949 studies of physicians, 
engineers, civil servants, teachers, students in general, and other 
groups show a mean correlation of ,155 of these educational outcomes 
with succ^:5s indicators; ^uch as income, self-rated happiness^, work 
performance and output indexes, and self-, peer-, and supervisor-ratings 
of occupational effectiveness. These results should challenge educators 
and^ researchers to seek a balance between continuing motivation 
and skills to learn and perf.orm well on new tasks as an individual or 
group member on one hand and mastery of teacher-chosen, textbwk ^ 
knowledge that niay soon be obsolete or forgotten on the othe'rC 

Perhaps since Socrates, however, .arguments over student-centered and 
teacher-centered education have remained so polarized, polemical, and 
pervasive that educators find it difficult to stand firmly on the high 
middle ground of balanced, joint, or cooperative determination of the 
goals, means, and evaluation of learning. Progressive education, the 
Dalton and Winnetka plans, team teaching, the ungraded school, and 
other innovations in this century held forth this ideal but gravitated 
toward authoritarian teaching or permissiveness and could not be 
sustained. Although open education;too, fad^d from view, it was more 
carefully researched; and syntheses of it may nelp prepare educators for 
evaluating future efforts, , ^ ^ . 

T hree Syntheses of Open Education . Honritz (1979) first synthesisted 
about 200 comparative studies of open and traditional education by 
tabulating vote counts by outcome category. Although many studies 
yielded non-significant or'mixed results especially with respect to 
academic achievement , self-concept* anxiety*^ adjustn^nt, and locus of 
cQntrol, moie positive results were found in open education on attitudes 
toward school, creativity, independence, curiosity* arid cooperation. 
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Peterson (1979) calculated effect sizes for the 45 published studies,. 
She found about or slightly Inferior^ effects of open education on 
reading and mathematics achievement; ,1 Co ,2 effects on creativity, 
attitudes toward school, and curiosity; and ,3 to ,5 effects on 
Independence and attitudes toward the teacher. 

Hedges, Glaconla, and Gage (1981) synthesized 153 studies Including 90 
dissertations using an adjustment of Glass's effect-size estimator. which 
Is slightly biased especially In small samples. The average effect was 
near zero for achievement, locus of control, self-^concept , and anxiety; 
about ,2 for adjustment, attitude towards ^school and teacher, curiosity, 
and general mental ability; and about .3 for cooperatlveness , 
creativity, and Independence, 

Despite the differences In study selection and synthesis methods, the 
three studies converge roughly on the same plausible conclusion: 
Students In open classe5> do slightly or no worse in standardized 
achievement and slightly to substantially better on several outcomes 
that educators, parents, and students hold to be of great value',. 
Unfortunately, the negative conclusion of Bennett's (1976) single 
study — prefaced, by a prominent psychologist, published by Harvard 
University Press, publicized by The New York Times and media and - cJipGr - t s- 
that take that newspaper as their source — probably sounded the death 
knell of open education, even though the conclusion of the study was 
later retracted (Aitkin, Bennett, & Hesketh, 1981) because of obvious 
statistical flaws In the original analysis (Aitkin, Anderson, & Hinde, 
1981). 

Components of Open Education , Glaconla -and Hedges (1982) took another 
recent and constructive step In the synthesis of open education 
research. From the prior effect-size synthesis, they Identified the 
studies with the largest positive and negative effects on several 
outcomes to differentiate more and less effective program features. 
They found that programs that are more effective In producing the 
non-achievement outcomeS"attltude, creativity, and 

self-concept — sacrificed academic achievement on standardized measures* 

These programs were characterized by emphasis on the role of the child 
±n learning, use of diagnostic rather than norm-referenced evaluation, 
individualized instruction, aud manipulative materials but not three 
other components sometimes thought essential to open programs — ^multi-age 
grouping, open ^pace, and team teaching,. Giaconla and Hedges speculate 
that children In the most extreme open programs may do somewhat less 
well on convenLlonal achievement tests because they have little 
experience with them. At any rate. It appears from the two most 
comprehensl^' synthases of effects that open. classes on average enhance 
several non-standard outcomes without detracting from academic 
achievement unless they are radically extreme. 
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Synthesis of. Instructional Theories 

To specify the productivity factors in further theoretical and 
operational detail that provide a more explicit framework for future . 
primary research and synthesis, Haerteli Walberg, and Weinstein (1983) 
compared eight contemporary psychological models of educational 
performance. Each of the first four factors in Table 7 — student ability 
*ind motivation, and quality and quantity of instruction— may be 
essential or necessary but insufficient by itself for classroom learning 
(age and developmental level are omitted because they are unspecified in 
the models) , - _ 

The other four factors in Table 7 are less clear; although they 
consistently predict outcomes, they may support or substitute for 
classroom learning* At any rate, it would seem useful to include all 
factors in future primary research to rule out exogenous causes and 
increase statistical precision of estimates of the effects of the 
essential and other factors, 

Table 7 shows that, .among the constructs, ability and quantity of 
instruction are widely and relatively richly specified among the models. 
Explicit theoretical treatments of motivation and quantity of 



instruction, however, are largely confined^to the Carroll tradition 
represented in the first four models; and the remaining factors are 
largely neglected, ^ 

The table poses empirically researchable theoretical questions; the 
tension between theoretical parsimony and operational detail, for 
example, suggests several: Can the first four constructs mediate the 
causal influences of the last four? Would assessments of Glaser's five 
students-entry behaviors allow more efficient instructional 
prescriptions than would, say, Carroll's, Bloom's, or Bennett's more 
general and more parsimonious ability subconstructs? Would less 
numerous subconstructs than Gagne's eight instructional qualities and 
Harnischfeger's and Wiley's seven time categories suffice? 

The theoretical formulation of educational performance models of the 
past two decades, since ^the Carroll and Briiner papers has made rapid 
^strides^ The models are explicit enough to be tested in ordinary 
classroom settings by experimental methods and'production functions* 
Future empirical research and syntheses that; are more comprehensive 
and better connected operationally to these multiple theoretical 
formulations should help .reach a greater. degree of theoretical and 
empirical consensus as well as more effective educational practice* 
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DeseRregatlon and Educational Productivity 

/ 

As the previous section has shown, sufficient empirical and theoretical 
syntheses have accumulated during the past five years to point more 
definitively than ever before to the proximal* alterable factors that 
affect educational achievement. Nearly' all the research has been 
carried out in natural settings such *as homes and^schools* and most of 
its shows general inability across student characteristics, subjects* and 
research methods, including randomized assignment to experimental 
treatments, 

the large average magnitude and consistency of many of these productive 
factors justly provides a substantial ambunt of confidence about how 
educational achievement may be. raised. Since many of the factors and 
techniques^ have already been extensively employed in ordinary schools 
and found successful* inexpensive, and non— controversial* it appears 
that educational achievement might be increased substantially by 
implementing a selection of the most productive of the factors* say, 
those with effect sizes above ,3, more extensively and intensively. The 
purpose of this s^ection is to compare the consistency and magnitude of 
such factors to the effects of school desegregation* as revealed by 
three recent meta-analyses — Krol (1978), Grain and Mahard (1982)*^ and my 
statistical summary of the studies meeting the selection /criteria of the - 
National Institute of Education (Klf:) panel of scholars, ^ 

Selection Criteria 

Aside from the inclusion of data only on Black students In all three 
meta-analyses, Krol (1978, p. 16), Grain and Mahard (1982* p, 6) and the 
NIE panel (Schneider* Note 1) varied considerably in explicit criteria 
for study selection, Krol, 'for example, excluded studies that lacked 
achievement measures before and after desegregation and those that lack 
sufficient statistics to calculate effect sizes (pp, 83-84), Excluding 
studies without pretests turns out Xo be a reasonable decision because 
Wortman^s (Note 2) research shows desegregated groups are on average 
advantaged o^ achievement «bef ore . desegregation. Thus apparent posttest 
advantages of ^desegregation are la p^rt attributable to pre-existing 
differences*' and pretest adjustment Is required for vali'd estimation of 
desegregation effects, * 

Grain and Mahard (1982) "excluded a large number of papers, many -of 
which compared' students in racially segregated and racially mixed 
schools J but gave no indication that a formal desegregation plan had * 
been adopted" (p. 6), Because they included studies that employed 
ability (in ""contrast to educational achievement) as a dependent variable 
and conducted a more recent arid exhaustive search* they used 93 studies 
for analysis in contrast to Krol's 55 (see Tables 8 and 9), 
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Effects of DeseiTCgsticm cn BUcV Adiievareent 
in Thiee Syntheses 



Source 



Effect Sizes 

Positive r 

Results Standard 
Percent Mean deviation 



Cocoents 



rrol (1978) 



61 .16 .41 Based on 71 comparisons in 55 

studies, srade level, Bathrna- 
tics and verbal schievenjent, and 
prograit-duration differences 
tested and found insi^ificant. 



Crain ( 

Kahard 

(1982) 



62 .10 .2^ percent calculated as sm of 175 

positive and half of 50 non-si£- 
Jiificant coicpariisons of 321 
co&parisons in 93 studies; 
effect-size sean leased on 70 
studies. With studies K 
inits, si^ificantly larger 
effects- in Vindergarten 
and ^ade one vere found. 



"Acceptable Si- .13 .24 Since the pretest advantage of 
Studies" desegregated gtox^B over con- 

trol groins vas .18. results 
are calculated for 11 study- 
veightcd Beans of posttests ad- 
justed for pretests. 
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Inferences f rco Thre« S/ntheses 
Aboxit tht Effects of ]>esegTe£:Bticm on Bltck AchievaMnt 

' Percent-Positive Studies Avtrugt Effect Sizes 

Sigruficance Kftgnltu^e Significance Kagnitu^e 

(.OS) C670 (.OS) (.20) 

Krol (1978) ? No ? No 

Crain ( ? No Yes No 

Mahard (1982) 

••AcceptaMe No No No No 

Studies" 

Conclusion No? No T No 

Kote--Tbe criteria for inferences are as follovs: The significance 
required is the standard .OS level calaileted for a sign test for a SO- 
SO split for positive vote counts, and a T test for the difference of 
the nean effect size froni zero, when possible^ en independeat- mits of 
analysis, that Ss, studies not comparisons. The sagnitude criteria are 
67 percent of the studies positive and an average effect size of .20, " 
for which the desegregated students woiild exceed S8 percent of the 
control -groif) students, 
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The NIE panel employed a number of stringent criteria for study ' 
rejection Including the following: non-emplrlcal and summary reports; 
studies done outside the U.S. and geographically non-specific; those 
that combined or compared ethnic groups, lacked contemporaneous-control 
or pre-desegregatlon data, or analyzed heteirogenousXy desegregated 
groups; those with tnore than 35 percent attrition, majority-Black 
desegregated conditions, varied exposure to desegregation, and 
non-co3iparable groups; those with unknown sampling procedures, 
croas-sectlonal data, or non-comparable samples at each observation 
point; those with unreliable or unstandardised- Instruments , unknown test, 
content or* instruments, unknown test administration dates, ability tests 
as dependent variables, and non-equivalent pretests and pofittests; and 
insufficient statistics (Schneider, Note Y) . Application of, these 
; exclusion criteria (Wortman, Note 2) resulted in 19 "acceptable 
studies," 

Thus J all three data sets are similar in Including only studies of Black 
achievement. They differ chiefly in that Krol and the NIE panel, unllke^ 
Grain and Mahard (1982), exclude ability tests, and the KIE panel 
employed stringent methodological criteria tl\at resulted in af selection 
of studies only 19 percent as large as Grain and Kahard's set (see 
Table 8), 

The NIE panel may be right in specifying stringent selection criteria 
from one viewpoint: the conclusions of review articles are usually based 
upon methodologically acceptable studies. But, as Glass, HcGaw, and 
Smith (1982, p, 226) point out, excludj.ng studies by implicit or explicit 
selection criteria can convert empiricai questions of research 
methodology to a priori assumptions. Excluding studies without 
pretests, for example, may exclude randomized experiments — possibly the 
best design in certain respects for probing causality and avoiding 
.untenable convariance assumptions. 

If it were to be found that randomi. %;d posttest only designs yielded the 
saL'e results as pretest-posttest quasi-experlments , then greater i 
confidence could be placed in the results than the results of either/ 
design by themselves, since the tvo designs are subject to different^ 
threats to^methodological validity (Cook & Gampbell, 1979), Because, 
for example, the findings on instructional research are generally ^robust 
and consistent across study features, such as research methods and 
student characteristics, substar^tial confidence can be placed in their 
r€tsults , 

Morevoer, excluding studies on policy or substantive criteria may be 
useful to lighten the effort or to narrow research questions, but 
exclusion also restricts the inferences and comparisons that can be made 
and the policies that may be^ Implied,* In the Krol and NIE selections, 
for example, it will not be possible to determine whether, desegregation 
has a different impact on achievement than it does on ability or other 
"educational outcomes such as creativity, critical thinking. Interest in 
further learning, and social perceptlveness* In none of the three sets 
of studi'esj moreover, will it be possible to compare the effects of 
desegregation on Asian, Black, Hispanic, and White students. At least 
for some parents, educators, policy makers, researchers, and others, it 
would be useful to have reliable information on these and other points. 
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None of this is to argue that^aflX istudles should be summarized in one 
overall vote count or mean^effact size. Although that statistic and its 
sigrlf icanc42 are of interest, characteristics of the studies such as 
Cook and Campbell's (1979) 33 threats to methodological validity, 
student characterlfitics such an ethnicity ^nd grade level, and 
conditions of desegregation such as voluntary and mandatory plans, 
should be categorized, coded» and tested for statistical significance 
with studies as the units to afford independence as assumed in 
statistical inference. (Tf desegregation is working generally well 
accordirfe to a study, then students in different grades within the study 
are likely do well, and their performance is correlated and not 
statistically independent^ similarly, if students are doing poorly in 
another study, different grades lack independence; therefore the means 
for studies, not for grade levels or other units, must be -taken as the 
units for meta-analysis, or each comparison in a study must be weighted 
inversely to the number of comparisons in the study. Another reason for 
using study means or weighting is to insure that each study is given an 
equal weighting of one, not a weighting based on the arbitrary number of 
comparisons the investigator happened to make,) 

Synthesis of Three Meta-analyses 

Tables 8 and 9 show what can be validly extracted as the chief findings 
from the three meta-analyses. Table 8 shows that thre€? estimates of 
percent-positive studies vary between 61 and 6A percent. These 
percentages are in surprisingly close agreement considering the widely 
different selection criteria *^nd numbers of studies in tbe three 
syntheses. 

Table 9 sKows that the statif;tical significance cannot be determined in 
two casei: because the per'bentage of positive comparisons rather tban 
i^tudies are reported; and, jn the NIE case, the sign test based on the 
number of studies is insignificant. By the norms of recent syntheses of 
productivity factors discussed in previous sections^ the percentage 
magnitudes are neither large (85^ percent) nor average (67 percent). The 
statl^^tical significance of the percentages cannot be determined in the 
two previous syntheses previously reported and is insignificant in the 
case of the n^IE selection. 

The statistical significance of the effect sizes are mixed: 
indeterminate for Krol, because of comparison weighting; significant for 
Craln and Hahard; and not significant for the set of studies acceptable 
to the I^IE penel. In none of the three cases was the magnitude of -the 
effect large (,45) or average (,20), (Grain andMahard's significant 
finding of higher effects in kindergarten and first grade are 
unsupported by Krol and reversed in analyses .by t^ojtman fKote 2); and 
their randomized-longitudinal effect is insignificant with study as the 
unit. Thus* their overall average study^veighted effect size is 
reported in Table 8,) 
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The results from the three meta-analyses suggest chat the vote counts 
fail with some uncertainty to reach conventional levels of statistical 
significance. By normative standards of recent syntheses of other 
educational factors, they clearly fail with respect to percentage 
results. The effect sizes as a set are indeterminate with respect to 
significance and certainly fail^ to reach criterion levels with respect 
to normative magnitude. 

Conclusion 

New techniques of research syntheses show a number of potent factors for 
improving educational achievement that have proven to be consistently 
effective in a wide variety of experimental and educational conditions. 
These include the amount and quality of instruction* constructive 
classroom morale, and stimulation in the home environment. It is in our 
national economic* social, and political interest to implement these 
factors more deeply and widely for all children (Walberg, 1983), In this 
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School Desegregation and 
Black Achievement: An Integrative View 

Paul M, Wortman 
University of Michigan 



PROBLEM 

Race relations between Blacks and Whites have played a significant role 
In the history of the United States, Social science theory and data, in 
particular* have figured prominently in the controversies that have 
constantly surrounded major events in this history. For example* the 
two landmark U,S. Supreme Court decisions dealing with desegregation, 
Plessy V, Ferguson in 1896 and Brown v. Board of Education in 1954 
(Kluger, 1975), were both based in part on current social science 
evidence* More recently, the so-called Coleman Report or the Equality 
of Educational Opportunity Survey (Coleman, Campbell, Hobson, 
McPartland, Mood, Weinfeld and York, 1966) was used by the Johnson 
administration to accelerate the desegregation process (Grant, 1973), 
The Coleman Report claimed that Black student achievement increased in 
more integrated environments (i*e*, with a greater proportion of White 
students)* This study and finding not only led to a number of 
reanalyses by social scientists, but also to an increasing number of 
systematic studies using before and after measures (i.e,, pretests and 
posttests) of achievement and control or comparison groups of segregated 
Blacks. These studies aimed at eliminating the methodological 
weaknesses of cross-sectional surveys ?^uch as the Coleman Report and 
testing some of its hypotheses and those of cither social scientists. 

By the mld-1970*s there had accumulated a sufficient body of scientific 
studies that a nuinber of careful reviews appeared. Two of the most 
notable of these reviews were conducted by Bradley and Bradley (1977) 
and St, John (1975), The Bradleys examined 29 studies of the effects of 
desegregation on Black achievement while St, John reviewed 64 (including 
12 cross-sectional studies). Both found the evidence inconclusive. The 
Bradleys concluded that t.he evidence on the effectiveness of 
desegregation on Black achievement was "inconsistent and inadequate" 
while St* John similarly acknowledged, "More than a decade of 
considerable research effort has produced no definitive positive 
findings-*' St. John vent on to quote Light and Smith (1971) that 
^'progress will only come when we are able to pool, in a systematic 
manner, the original data from the studies,". Such methods for 
synthesizing the results of scler^tiflc studies have recently gained 
widespread popularity largely cue to Glass* seminal work on 
'^meta-analysis" (1976. 1977), 

Meta-analysis offers a number of advantages over previous methods for 
aggregating the findings of different studies (Light and Smith, 1971; 
Glass, 1977), In Table 1 we have listed some of the positive and 
negative characteristics of this technique* The major positive 
qualities are a single, precise, quantitative measure of the average 



1.97 



Advantages and C ts^iidvantage^s of Meta-analysis 
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magnitude of program impact. It Is applicable to most social science 
research and provides an Important result that Is easy to grasp. 
Meta-analysis also allows one to consider sample size and design 
quality* This technique also has its "disadvantages" especially 
when extended tov^^studles with methodological problems such as 
quasi-experlments (1 *e * , studies lacking random, assignment) * 

Standard meta-analytic methods have already been applied to this 
literature (Grain and Mahard, 1982; Krolt 1978)* The meta-analyses 
performed by Krol and Grain and Mahard both found small positive 
benefits for desegregation on Black achievement (^16 and ,08 standard 
deviations, respectively). Both are flawed In our opinion, Krol's 
study Illustrates the Inappropriate application of Class* method* For 
example* Glass (1977, p, 356) does recommend using pre-experimental 
designs lacking controls "if the treated group members' pretreatment 
status Is B good estimate of their hypothetical posttreatment in the 
absence of treatment*" As we will demonstrate in the next section* this 
suggestion may be unwarranted and ill-advised. Grain and Mahard (1982) 
In a. very recent meta-analysis have taken a traditional Classlan 
approach and included all studies In their analysis. As we shall 
Indicate below, we feel this approach is inappropriate. Many studies 
have so many methodological weaknesses that they should not be Included* 
Moreover* some studies such as those using a cross-sectional survey 
cannot yield the necessary statistical information (since they lack both 
a pre-desegregatlon or pretest measure as well as a control group) » but 
were included by Grain and Mahard* Other studies used White control 
groups or national test ftorras to generate effect sizes — both are 
Irapproprlate comparisons as will be 4iscussed below* Such studies 
account for half of those Inducted In Grain and Mahard's meta-analysis* 
Most Importantly* however, both Krol and Grain and Mahard paid 
Insufficient attention to the threats to validity that could confound 
and bias the results of their meta-analyses* 

The school desegregation -achievement literature poses some special 
problems for the meta-analysis method. It Is almost entirely 
quasi-experimental in composition and thus susceptible to other 
Interpretations (i*e,, so-called "plausible rival hypotheses"). 
Meta-analysis of such studies assumes that either appropriate 
statistical adjustments can be made for the various "threats to 
validity" or that the "strategic comblratlon argument** (Staines, 1974^ 
holds (see "disadvantages" In Table 1)* This latter term stands for the 
belief that flawed studies can be combined because the "weaknesses 
cancel each other out*" It is Just this argument that Glass f*1977) used 
in recommending meta-analysis of "weak" studies. While Class was 
Initially confident that his method could be used with 
quasl-experlments » his views have gradually changed Ccf, Class and. 
Smith, 1979)* The examination of the desegregation quasi-experimental 
studies presented in the following sections indicates that selection Is 
a persistent "plausible rival hypothesis," That ls» It is ^ot cancelled 
out. Therefore* a number of steps have been taken to deal with this* 
First » an adjustment was developed for reducing the bias due to 
selection. Second, studies that were judged a prior not to have 
selection problems were compared with those requiring adjustment* 
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The focus of this paper is on the effect of school desegregation on 
Black achievement^ While interest in these data is -primarily 
methodological and stems from earlier work by the author on the 
secondary analysis of the Riverside School Study>-(ttSS) of desegregation 
(Linsenmeier and Wortman^ 1978; Moskowitz and^Wortman» 198l)» a number 
of substantive issues are addressed. In addition to estimating the 
overall effectiveness of desegregation^ such issues as the impact of 
type of achievement (math or verbal) and time of desegregation (early or 
later grades) are also discussed. This latter* substantive focus 
qualifies this study as an "integrative review" (Jackson* 1980), In th^ 
next section* the meta--analytic method used in this study is described. 
As the ''disadvantages" coluinn in Table 1 indicates not all studies are 
suitable for meta-analysis. Those with numerous or severe 
methodological flaws* inadequate reporting of statistical information* 
or insufficient control data were not included. In the third sectior^* 
the procedure for including studies in the analysis is described. The 
results and conclusions are presented in the last two sections, 

METHODOLOGY 



To apply meta-analysis to quasi-experimental data one needs to obtain 
measure of "effect s?2e" (ES), The basic equation adopted from Cohen 
(1969) is: 



(1) 
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siluAliOn it is likely thai the groups will differ initisMy. Thai is» 
selection %% i major threai to validity that is represented in this 
mode I . 

Meia-analys tS involves Surmrring of Ihe effeci Si2e estimates frott 
an siudies. wt define il m%i 
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X is the sanple pean of the experimental or control group at 
time 1 and 2 for the 1^^ study and £ is the control group 
standard deviation. 

The average effect size, , Is usually preser^ted* This average can be 
computed. In .a nunber of ways. For example^ all ESs can be summed and 
averaged. Since nany ESs may be derived from a single study, this 
introduce? bias due to nonlndependent measures. It was largely for this 
reason the Landman and Dawes (1982) reanalyzed Smith and Glass' (1977) 
neta-analysls of the effectiveness of psychotherapy. 

The desegregation literature Is largely composed of quasl-experlments or 
even uore poorly designed studies. As such, it Is susceptible to a 
variety of threats to Internal validity (i.e., the ability to Infer 
causality). It is risky to assumt that these potential sources of bias 
can be treated as random errors that are self -cancelling. Two threats, 
in particular, have been ii:uch discussed in reviews of this literature. 
They are "selection" and "differential growth" or "maturation." These 
are considered In the next paragraphs; other threats to validity are 
discussed in the next section. 
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Selection 

Campbell and his associates (Campbell and Erlebacher, 1967; Campbell and 
Boruch, 1975; Campbell and Stanley, 1966; Cook and Campbell, 1979) have 
been concerned with the recurrent problem In estimating program effects 
When various selection procedures are used. In particular, they have 
discussed selection of those students wl'th extreme (pretest) scores and/ 
or matching experimental and control subjects by (pretest) score. Both 
of these selection procedures are subject to substantial "regression 
artifacts" resulting from the unreliability of the measures used. Uhlle 
there is no agreed-upon procedure for adjusting for these selection 
effects, a number of methods have been developed (cf. Uortman, 
Relchardt, and St. Pierre, 1978). These methods require both 
student-level data and test reliabilities in order to be applied. That 
information is generally not reported in the studies of desegregation 
and would require reanalysls of ' individual studies if available. // 
Instead, the pretest adjustment procedure described in Equations 2 and 3 



will be employed. ?;ince matching was rarely used, thit; method should 
adjust for the selection or ^'subject equivalence" problem that Brac?ley 
and Er;idley (1977) and St, John (1975) found to be the major 
methodological weakness in the better or "veil designed" studies, 
li'cicher Grain and Mahard (1982) nor Krol (1978) attempted to correct or 
adjust for bias Introduced by initial subject nonequivalence < 

Differential Growth 

Tt is well-known that Blacks and Whites show different rates of 
intellectual growth. Thus dif feirentlai growth or "maturation" may be 
considered an important- source of bias in synthesizing; the data from the 
d*:segrega?:l<?r. 'ii t'erature. This problom is dealt with In three ways: 
-c-onceptnally, empirically and analytically. First, only studies using 
Black controls were examined. This is the comparison recommended by St, 
John (1975) and should reduce or eliminate the problem. Such controls 
avola problems (or confounds) caused by race and socioeconomic status. 
They ^]*;o allow examination of the major policy question being 
address**d: the effect of continued racial isolation or segregation. 
Fortunately, most studies used such a control group (i-e,, segregated 
Blacks), As noted above^ both Grain, and Kahafd (1982) and Krol (1978) 
included s^tudies that used White controls. 

Second, the results of the pretest adjustment are compared to those 
<jtudies not requiring such corrections (i,e,, no pretest differences) to 
determine if other differences or sources of bias remain. As will be 
noted, "differential regression to the mean" (Cook and Campbell, 1979) 
may account for the residual difference, -And third, the analytic tnethcd 
is examined to determine its robustness to this source of bias. It may 
he recognized that Equation 2 is identical to the model for differential 
growth rates labelled by Campbell the "fan spread hypothesis" (Campbell 
, and Erlebacher, 1970; Cook and Campbell, 1979), In fact, if 
differential growth is the only cause of change from tine 1 to cine 2, 
then according to the fan spread model: 




This hypothesis implies that an increase in the mean is accompanied by a 
proportional increase In the within-group variance. Thus, ES=0 when 
this "threat to validity" (i.e,, differential growth) is present. This 
means that selection-maturation interaction will not bias the estimate 
of effect size for cuasi-experiments of this type (i,fe*, the 
nonequivalenc control group design or NECGD) that are pretest-adjusted. 
This is exactly the model proposed by Campbell (1971) and described by 
Kenny (1975), As Campbell and Boruch (1975) note, standardizing scores 
will eliminate this problem. The effect size measure as defined above 
in Equation I is a standardized score. 
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Practical Liinitations 



There are a nzimber o 
into an actual meta- 
standard deviations 
pretest and posttest 
especially in those 
were obtained. The 
be reported* In ord 
indirect approaches 



f problems In translating this small analytic model 
analysis. First, the NECGD requires the means and 
for the experimental and control groups on both the 

Often these essential data are not furnished ^ 
ca5ies where statistically non- significant results 
reliability of the tests used is even less likely to 
er to deal with this situation, a variety of ^ 
have been proposed (cf* Glass., 1977)* ^ 



/ Using Significance Results * Repo^^rts. of ten provide only inform^atlon on 
sample size, significance level, and the^value of the test.^^tatistlc* 
In the&e cases the etfect site can be obtained^cing indire*ct methods* 
Tn the case of the t-test. It is: 



ES 

from t 



•A 



4 1 

"2 



where n^j » n2 and thus about half of the degrees of freedom (df), then 
according to Rosenthal (197S): 
ES • 2t 



This indirect estimate will be conservative when the exact significance 
level is not reported, and the t_ value is not given. Typically, the *05 
or *01 significance levels are used in social science research* If the 
results are not significant, little if any Information is usually 
provided* In this case, a *50 significance level will be used as Cooper 
(1979) has suggested* This is the expected mean value of the 
distribution of non-significant studies. Similar indirect computations 
can be derived from other test statistics such as T_ (see Appendix 7 in 
Smith, Glass, and Miller, 1980), 

204 



Gain Scores , Another common form of reporting results is the gain 
score* This Is the change Ir. each group from pretest to posttest. In 
Figure 1 this would be; 

for experimental and control groups, respectively, A simple algebraic 
toanlpulation reveals that the difference in the two gain scores is 
equivalent to the numerator In the basic equation to estimate the effect 
size for quasi^experiments (Eq, 2)* Thus if St ^ Sx t gain scores can 
be used to derive d_ for the I5ECGI) quasi-expertment , 

Other Qua si -experimental Designs , Other quasi-^experlmental designs are 
often encountered and It Is important to consider them as well. The 
most frequently reported Is the case study or in Campbell and 
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Stanley*s terminology, the One-Croup Pretest-Posttest fOGPP) Design, 
This is the NECGO without the control group, Krol (1979^ suggests that 
fin effect size estimate can be obtained by using the pretest mean and 
standard deviation as. the control group. This is a risky assumption in 
our opinion, and one that Is likely to lead to an over&stlciate of ES, 
As can be readily seen in Figure 1, the use of the standardized gain 
score ( £^ * e* )contains a pseudo-effect equal to C^^Ci . Moreover* 
If strict selection criteria are used as they often are in compensatory 
education or competency testing remediation programs, then regression 
effects will also be Incorrectly included. Thus we feel such case study 
data should only be used when the proj^er adjustments can be made. In 
order to examine design effects in meta-analysis, a nustber of these case 
studies were Included In some of the analyses* 

Control group data are frequently difficult to obtain for political and 
practical reasons* Programs may be designed to serve all in need, for 
example. As a consequence, researchers often attempt to solve the 
control group problem by^ using historical controls or "cohort 
comparisons" according to Craln and Mahard (1982), In fact, this 
procedure has been recommended in some areas (cf» Gehan and Freireich, 
1974), In education historical control groups are often created using 
student data from the same grades during prior years (i*e,, before the 
program innovation). This adds **history** to the list of possible 
threats to validity since these data are not obtained concurrently with 
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the experinental (i.e<, desegregation) data* Again extreme care is 
needed in interpreting these data* 

Sometimes it if; possible to create a cohort of students who are followed 
prior to the start of the program. This allows a "dry run" NECCD 
experiment (where there is no treatment) to be created and £in estimate 
of the adequacy of the various adjustment procedures to be obtained 
(Wortrcan, Reichardt, and St* Pierre. 1978)* Such data are rarely 
available, though. If repeated claf^scs show similar effects* however* 
then the data are probably reliable* This variant of the "Recurrent 
Institutional Cycle Design" is sometimes used (cf* Teele, 
1973)* In general* historical controls have been found to grossly 
Overestimate effects and thus should not be used if possible fS^scks et 
al*» 1982)* In education, for example* test scores were declining 
during the l960's and l970's so that historical controls would probably 
'have higher scores* Such studies were not included ir our analyses, but 
comprised 17 percent of the s:tudies in Grain and Mahard's (1982) 
meta-analysis* More recently, Crain (1983) has included eight such 
studies among his "20 best*" 

True Experiments * Although our focus has been on quasi-experiments » 
"true' or randomized studies would be useful* Just as we were concerned 
about the biased estinates produced by pre-experimental design (i*e<» 
cane) studies when compared to th'e NECGD quasi-experiments» it is 
important to determine the bias resulting from the latter designs* This 
information can be obtained if effect size estimates ^re available from 
randomized studies* Not all data sets have thif; mixture of designs* 
especially in education where there has been a strong tendency for 
applied, field problems to be approached quasi- experimentally while 
laboratory* theoretical^ .issues have been investigated using randomized 
studies* There have been a few randomized studies or true experiments 
in the school desegregation area* Those that have been conducted such 
as Project Concern (Iwanicki and Gable* 1S7£) often report their I'esults 
in such a way as to. m*ake it impossible to derive effect size estimates. 

'Grain (1983) identified five randomized studies atnotig his top 20t three 
of which were based oii data from Project Concern* Three of these 
studies (Rock et al*» 1968; Samuels , J 971 ; Zdep, 1971 — see Appendix A) 
were included among the 31 found acceptable in the present analysis* A 
more recent report from Project Concern (Iwanicki and Gable, 1978) was 
included in place of the two earlier reports: used by Craiti* 2 

Design Quality 

Although the focus is on the Is^ECGD* the quality of the studies using 
this design varies* Moreover, as rioted above, there are often other 
designs employed* A number of approaches to assessing quality have been 
developed- The most veil-known ii> the validity approach developed^ by 
Campbell and Stanley (1966) and recently further refined hy Cook and 
Campbell (1979)* Essentially, the threats to validity indicate quality* 
Others (Boruch and Gomez, J977; Sechrest and Tfeaton, 1981) have strey^ed 
the "implementation" or "integrity" of the treatment* This is an 
important concept although, one that is difficult to measure. The 
assessment of research quality is a new area and one that i? critical in 
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the synthesis of scientific studies. There h been much discussion of 
this Issue (Mansfield and Lusse, 1977; Eysenck, 1978; Glass, 1977, 1978^ 
and the debate still continues (cf,, Wortman, 1983), As the following 
section Indicates, design quality Is viewed as significant In selecting, 
coding, and analyzing the data In a research synthesis. 

PROCEDURE 

The meta-analysis approach first requires the retrieval of relevant 
scientific Information, The Importance of a thoroughly documented 
procedure at this point has been stressed by both Cooper (1982) and 
Jackson (1980). To that end, we obtained the cooperation of the authors 
of the two major studies systenatically synthesizing the literature on 
the effects of school desegregation on Black achievement ^Crain and 
Kahard, 1978; Krol, 1978). Both Robert Craln and Ronald Krol generously 
provided copies of the articles and the coding schemes used In their 
analyses. We then extended and updated this data base through literature 
searches Including ERIC, dissertation abstracts, references In the 
articles and books (especially, St, John, 1975), and dozens of letters 
to authors and school district offices. We developed a coding scheme 
and list of studies to be Included In our analyses. These are described 
below. As we progressed with our Initial coding effort, we realized 
that there were many studies that would have to be rejected. We felt It 
Imperative to describe these studies and our reasons for rejecting them 
from the analysis. We did this for two reasons: ''a) this Is perhaps 
the most Important* but judgmental, step In data synthesis* and ^b) It , 
Is Important to determine whether there are unique characteristics of 
excluded studies. All studies were read and coded by two Independent 
reviewers. All discrepancies were resolved so that perfect agreement 
was reached, A more detailed description of this procedure and the 
studies excluded can be found In an earlier technical report (Wortman, 
King and Bryant, 1982>- In the next three sections we discuss both of 
these concerns. 

Exclusion Criteria , The decision to exclude a particular study from the 
analyses was based on assessments of the various threats to the study's 
validity. The number and magnitude of the flaws In the study were the 
deciding factor for Inclusion or exclusion. The observed threats to 
validity fall Into one or more of four basic classifications that have 
been developed by Campbell and his associates (Campbell and Stanley, 
1963; Cook and Campbell, 1979), Thus, the criteria used to reject 
studies ^see Table 21 represent specific instances or threats to 
internal, external, construct, or statistical conclusion validity. 

Internal validity Is broadly concerned with whether the treatment (l,e,, 
school desegregation "1 In fact affected the outcome (l.e,, academic 
achievement of Black students) , Threats to Internal validity may be 
posed by uncontrolled variables representing effects of history, 
maturation* and the like as originally described by Campbell and Stanley 
(1963), Most of the factors listed In the table as threats to validity 
do not require further explication. However, the rationale behind a few 
may not be so apparent. For Instance, studies utilizing cross-sectional 
survey designs (criterion ^a) were rejected from the analyses because 
they typically do not control for extraneous variables in local school 
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settings that inay affect achievement above and beyond the effects of 
defvegregatlon , That is, they are usjually observations at o-ne point in 
time lacking both pretests and adequate controls. 



Studies were also rejected that faJled to describe their sampling 
procedures (criterion 4b) and thus make it impossible to ruJe out 
potentially confouniUng biac;es in the selection of comparison groups;. 
Finally, the use of different tests for segregated and desegregated 
students at either pretest or posttest may pose "instrumentation" 
problems stemming frotn differential test reliability and low inter-test 
realiahllity. These problems may either produce spurious treatment 
effects or mask real efi^ects, Fach of these specific threats may 
confound the observ^id ^ association between desegregation and achievement* 

External validity refers to limitations in the generalizability of the 
study with regard to populations, settings, as well as treatment and 
tncasurement variables. One obvious reason for exclusion was studies 
conducted outside of the United States, Another cotmnon threat to 
external validity involved the confounding effect of compensatory 
equalization of treatment (e.g,, extra teachers for segregated controls) 
or other kinds of multiple treatment interference (criterion 3g) , These 
may disguise or distort findings indicating how desegregation affects 
achievement. Moreover, when the dates of test administration are not 
described (criterion 5c), problcMns arise In adjusting the effect-size 
estimates to a proper time interval as veil as determining whether the 
pretest actually occurred prior to desegregation. 

Construct validitv refers to the appropriateness of the theoretical 
constructs, variables , and measures used. If the study did not really 
deal with des;egregation and/or achievement, it was not included. Other 
studies were rejected on these grounds, but for less obvious reasons. 
These include those that at first appear to measure academic achievement 
of desegregated Blacks, but which, in fact, measure a different 
construct such as I.Q, (an ability measure); those that measure a 
different treatment, such as bus transportation; or a different 
population such as Ivhites or Chlcanos (see criterion 3a), 

Statjgitical conclusion validity is concerned with the appropriateness of 
the statistical analyses. This includes not only the analyses employed 
but also the sufficiency of the data reported for calculating effect 
sizes. For example, a study may improperly use ANOVA in the analysis of 
a non-equivalent control group design (i,e,, criterion 6h) that violates 
assumptions of homogeneity of variance and of heteroscedasticity. Other 
studies may correctly employ statistical procedures where there is 
inadequate statistical power from sainple sizes too small to reject the 
null hypothesis. Finally, studies which grossly combine achievement 
results of different grade levels mjst be rejected because the rate of 
achievement gain tends ro increase more slowly with advancing grade 
lev^l and thus grade-equivalent scores are really not comparable (as 
they are norsed within each grade separately). Combining scores from 
various tests across grade levels further threatens internal 
validity insofar as instrumentation effects arise from variations in 
test reliability and other test characteristic (e.g,, item difficulty 
and content). 




Applying the criteria listed in Table 2 resulted in the exclusion of 74 
studies * Most suffered from more than one problc^n, A number of these 
criteria are sufficient in thenselves (i*e., "fat^^l flaws") to eliminate 
a study. All bur three studies had such flaws- Overall* we have had to 
exclude the majority of studies examined including a number used in the 
previous meta-anniyses performed (Grain and Mahard* 1978; Krol* 1978)- 
A comparison of studies included and excluded is provided in Table 3* 
With the exception of Crain and Mahard (1978), we Included only about 
half of the studies used in other major reviews. The 31 studies 
included iu our analyses are listed in Appendix A, The studies were 
decomposed into effect size data for each grade and for reading and 
mathematics achievement » and thus yielded "106 separate "cases/* The 
overall analyses* however* used the study as the unit of analysis by 
averaging the results within each study and combining these average 
effect sizes. # 

Table 3 
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A considerable amount of effort was spent in documenting this aspect of 
the research synthesis. It represents an important* but often 
overlooked* part of formal data s>'nthesis procedures* and one that c^in 
produce differing results', labile meta-aralysis, itself* is a formal, 
quantitative method, the selection of the sample to include in the 
analysis is not. Without appropriate, documented selection criteria, 
the results can be as subjective and biased as the literature reviews 
they seek to replace (cf* Jackson* 1980)* 

One "disadvantage" of meta-analysis (see Table 1) is its susceptibility 
to publication bias. It is assumed that the research literature 
contains only studies showing positive* statistically significant 
results (i*e,* publishable studies). The 31 studies found "acceptable" 
contained only two published articles* Desegregation research is 
largely (and perhaps appropriately) a fugitive literature. We feel that 
the retrieval strategy described above has captured th^ ^'target 
population" of studies. (Cooper, 1982) • 

The NTE Core Studies 

After this screening process had been performed and the 31 resulting 
studies analyzed, the NIE D^isegre'gation Studies Team convened an expert 
panel to select the best rtudles in this area. The panel of six 
scholars including this author was supposedly balanced iu their 
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attitudes end published work on desegregation " two pro, two con» and 
two neuti'al*^ The panel met^ in July, 1982 and initiated discussion of 
the most appropriate studies to be included in reviewing the literature. 
The criteria listed in Table 2 were examined by the panel and after some 
discussion a subset of th^m was used to select the highest quality 
studies available* In general these were KECGD studies comparing verbal 
and/or math achievement of desegregated and segregated Blacks, The 
criteria actually used are starred in the table. 

These criteria were en::ered into the computerized data base and 18' 
studies were found that satisfied these requirements* These studies are 
starred in Appendix One new study by Walberg (1971) was added at the 
request of some of the panel member?;* This study had been "rejected'* in 
the original analyses since it suffered from an extremely high rate of 
attrition fcriterion 3h) that differed for segregated and desegregated 
students Ci,e*» 27 and 48 percent, respectively)* The number of 
students in the desegregated control group was quite sinallt ranging from 
14 to 53. Moreover, grade levels were corbined (criterion 4d). The 
Walberg study added eight "cases'* to the data base. Moreover* one of 
the panelists wrote to one of the authors of another study (Sheehan* 
1979) to obtain missing means and standard deviations. This allowed the 
inclusion of two additional cases. 

These studies differ substantially from those used in most previous 
revievs. Vlith the exception of Grain and Mahard (1978), where all but 
one study was included, fewer than half were included in prior reviews. 
For example, Bradley and Bradley (1977) included only five of these 
studies while St, John (1975) reviewed only nine of them, 

RESr^TS 

The Glass effect sizes (ESs) for the 31 studies considered 
methodologically acceptable for performing a meta-analysis are presented 
in T^ible 4, The fourth row labelled "Grand" presents the overall 
tiffects averaged by study (i.e,, the average of the average effect sizes 
for each study) and the ESs by three inajor research designs* In 
addition* these four categories are broken down by grade in the bottom 
twelve rows. The ESs for reading and mathematics are combined in this 
initial analysis to provide a single measure of overall effectiveness. 
Since sone reviewers have noted greater gains for matheTnatics than 
verbal achievement (St, John, 1975; Krolt 1978), .ESs for these two areas 
of achievement were also examined and are reported below. 

The overall ES for the 31 studies is ",45 standard deviations. The ES is 
relatively unaffected by various weighting schemes. This figure is 
considerably larger than those reported by Grain and Mahard (1982) and 
Krol C1S78), However, the ESs for the more vell^designed quasi-- 
experiment,^ are considerably smaller (i»e., ,32 and •IS), It is clear 
that the studies using the weaker OCPP design are inflating the estimate 
of the ES (i»e*, 1*22), As vas noted esrlier* this letter design 
confounds maturation and initial differences in student selection with 
the effect of desegregation. Such design effects resulting from 
differences in study quality are coTumonly reported (cf, Wortman* 1983), 
In practically all Such cases the weaker designs produce larger 
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estimates of eff<:cts. Thus design quality must be cor.sidered in 
conducting an integrative review. As Jackson (1980) notes» "The resu]tf; 
of the analysis may be misleading if there is not at least codest 
number of studies with good overall design/* 

The bottor. twelve rou's of the table present the results by grade. The 
general pattern is for an increase in ES for grades; 1^8 followed by a 
decline for the Ifiter grndes. This finding contradicts those reported 
by Crain and Mahard (1978) and St, John (1975), The , Glass ES for grades 
K-6 vas slightly, hut not star ist ically t lower than the ES for grades 
7^12 (,43 and ,55» ret^pectively) , Given the varying duration of these 
studies, Stephan (19P2) calculated the ES per DOnth for the ME Core 
Studies, lie found a p^ittern consistent with Crain and Mahard (1982) and 
St, John (1975), 

All of these estimates of ES are su5ceptible to bias due to selection or 

iihs^cnce of Initial subjuct equivalence. The result for those studies 

where it was possible to employ the t^retest adjustment to 

remove initial differences between segregated and desegre^./ited groups 

are presented in Table 5, These studies used the non-equivalent control 

group desigrj and reportec? sufficient pretest information tc calculate 

ESs. 

T*Dle 5 

TusieJ antf ui^t busied me i nods for iht 



ne ihoc 


Over*n 
ne«r> ES 


Seleci ion 
PrODl etn» 


Kr Se lec: ion 
Pr&&i entis 


Lin«dju»led 




0.57 (^-20) 


0.20 (n-lO) 


Preieil 
Adjusted 






0.20 Cn-10) 


P« rrvr f e 
t"v» lue 




£< .01 


t^g-0, n.». 



In tvo cases it waF rot possible to detemine whether or not thera were 
selection problems. 

The fir6;t column of the table Indicates a sizeable and statistically 
significant difference between the "overall" unadjusted. Glass 
effect-size estimate and the pretest adjusted estimate (^Al.and ,16, 
respectively). The Glass estimate is similar to that reported above in 
Table All studies were initially coded along a number of dimensions 
including most of Cook and Campbell's threats to validity before any 
effect sizes vere actually calculated. The second and third columns 
^compare studies with and without selection problems. The Glass ES 
^^y4stimatfe is higher for those studies with "selection problems" than the 
overall E5 vhile the pretest-adju5ted estimate remains the same as 
before (,57 and ,16, respectively). Again, the two estimate? are 
significantly different by statistical criteria. On the other hand. 
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vhere selection was not considered a problem^ the two estiinato? of ES 
are exactly the sane C,20), This number ifi slightly higher for the 
pretest-adjusted estlr?ates since two cases were oiultted where It w^s not 
possible to determine a priori whether selection was a problem. 

The difference between the pretest-adjusted ES and the ES for studies 
without selection problems may result from differential regression. 
Since the students involved in these studies generally score below the 
mean for their grade* their scores wlJl regress to the higher mean at 
post-test solely due to' the measurement error in the tests,- Moreover, 
with an initial difference of ,26 standard deviations, the control 
segregated students will regress Taore, This Implies that the pretest 
correction overadjusts slightly. Assuming a reliable test rellabllitv 
of 0,8 to 0,9 for these students will account for the ,04 difference. 

The pretest-adjustment ^nethod thus appears to remove the initial 
differences due to subject nonequlvalence. It is the author's opinion 
that this provides a fairly accurate estimate of the overall actual 
benefit of desegregation on minority. Black achievement. According to 
Glass et al,^ (1981^ p, 103), each ,1 ES is equal to ,1 grade equivalents 
or one month of educational gain. Thus desegregated students may be 
gaining about two months due to attending an integrated envlronnent. 
The analysis Indicates only a slight, but statistically non-significant, 
gain for the few cases where results greater than one school year were 
reported. Similarly, there were only a very few cases where the 
percentage Black was reported, .Vhen the difference between percentage 
Black in the control Ci,e,, segregated) and treatment (i,e,, 
desegregated) groups was calculated, it revealed that most of the 
effects were obtained in those ^studies where the difference ranged froc: 
76 to 85 percent. That is, students moving from almost completely 
segregated environments to predominantly White schools shoved a sizeable 
Cl,06 ES using the Glass method) effect, Thl^ finding is consistent 
with the Coleman Report, 

Finally, the Glass effect size cstlmtes for reading and mathematics 
were examined separately. These results are presented in Table 6, As 
with the overall ES, both effects are positive indicating a benefit for 
desegregated students* Contrary to previous research CiCrol, 1978; St, 
John, 1975) the ES for reading achievement was considerably larger than 
thst for math (,57 and .33, respectively). This difference was not 
statistically significant, however. Thus a single overall estlmte of 
achievement effects appear^; to be an appropriate measure of the Impact 
of desegregation. 
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The i;iE Core Studies 



A similar analysis was perfonued on the 19 studies selected by the KIE 
panel of experts, llic results are presented in Table 7, The 
information is presented by study with overall effects presented at the 
end. The pattern of re*;ults is quite similar to those presented above. 
All ESs are again positive indicating a beneficial itapact of 
desegregation on achievement- The ESs are slightly lower partly due to 
the inclusion of the negative ESs for the Sheehan (1979) and Walberg 
(1971) studies- 

The overall mean unadjusted Glass ES is <25* The unadjusted ES eftimate 
is comparable to the ,23 reported by Grain and Mahard fl982) and» iriore 
recently, the ,2^ by Grain (1983) for the best designed studies. It is 
only slightly less than the *28 ES that Grain and Mahard (1982) claim 
for^"the estimated tr<>atment assuming the best possible research 
design," However* all of those estimates ignore the bias introduced hy 
the initial nonequivalenee of the students. When adjusted for pretest 
differences* the ES is reduced to ,14* Compared to the original 31 
studies, jthe decrease for the Glass ES is. ,17, but.it is only .02 for 
the prete^t^^a^yTTsTed ES* The reason for this is that negative ESs have 
been added by the panel ,to the core studies which largely, but not 
entirely, reflect pre-existing differences among segregated and 
desegregated students* In these cases, however* the differences favored 
the segregated students* in fact, there is a large correlation betwe<*n 
pretest and posttest effects sizes (r^ » *76) indicating that 
pre-existirg differences largely remain at the posttest. Thus subject 
equivalence is a persistent source of bias in these studies. It is for 
this reason that the pretest adjustment method was employed. This 
adjusted ES prov'ides a less biased estiriate of the overall ef f ectivenes.s 
of desegregation. The adjustment is equally successful for studies with 
large ESs (greater than 1*0) such^s Rentsch (1967), 

As with the larger set of 31 studies* the core studies Ghow the effects 
for reading achievement ^to be modestly larger than those for mathematics 
(,28 and ,23, respectively), Kowever* when these figures are decomposed \ 
by duration or length of desegregation, there is an interaction vlth 
mathematics showing larger effects for those studies longer than one 
year*. While there are relatively few cases available, this may explain 
the difference between the overall results in this study and those \ /= 
reported by others. It me> be that studies of longer duration 
comprised the majority of those reviewed by Krol (1978) and St, John 
(1975), 
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SUMMARY 



The synthesis of scientific research using foraa] statistical procedures 
such as Class' metJi-analysls prtisents special problems when studies are 
methodologically f ] awed , The research 1 it erature on the effectiveness 
of school desegregation on minority Black achievement Is almost totaUy 
comprised of quasi-experiments or weaker research designs* VTiile Clasp 
has recommended including all studies in a research synthesis, his work 
has largely dealt with studies that are "well designed.*' In those 
in£t;inces where "poorly designed'* studies have boen included, design 
effects have been found (Class and Smith. 1979; Cilbert et al,, 1977; 
Wortman, 1981) indicating major differences in estimatesof effects - 
between studies with strong and. weak designs. The typical approach to 
this problem is to examine the higfter-quality studies taking into 
account* where possible, the flaws or threats to validity. This was the 
approach taken in this study. Specific methodological criteria- for 
including studies in the research synthesis were developed and applied 
to the school desegregation literature. All studies were found to have 
some serioufi flaws, but 31 were considered acceptable for analysis. 
Even within this set, there was variation in design quality and a 
considerable design effect. The NIE panel of experts decided to include 
only the higtiest quality studies and this further reduced the set to 18 
studies. The study by vTalberg (1971) vas felt to be' of sufficient- 
quality to be added to this set although it had originally been ^ 
"rejected'* for a variety of methodological flaws. 

The KIE Core Studies had an overall effect size 'of ,25 standard 
deviations. This is alr^ost identical to the effect size estimate 
reported by Crain and his associates for well-designed studies. Since 
most of these studies suffered from initial subject rionequivalence » an 
adjusted effect size was calculated by subtracting out the effect size 
at the pretest prior to desegregation. This resulted in-an effect size 
of ,K, Civen differential statistical regression to the meant^this is 
probably a slight underestimate. This is similar to that found for the 
larger set of 31 studies and also to Krol ' s (1978) finding. In 
examining the results of the tvo analyses reported above, the best 
overall estimate of the effect of school desegregation on Black 
achievement appears^ to be about v2 of a standard deviation,. This 
estimate i^? based on those cases not having selection problems and Is 
comparable to the adjusted estimate^. 

Other subsidiary analyses comparing type of achievement » duration of - 
desegregation* grade level* and difference in percent Black for 
segregated and' desegregated students were also examined* -Reading was 
found to be slightly higher than math achievement although this may vary 
with length of desegregation. The larger set of studies revealed a 
curvilinear pattern of effects with an Increase from grades k-7 ?nd a 
decrease from S-12, This result does not agree with otticr findings 



indicating larger benefits the earlier desegregation occurs, Ko effect 
was found for amount of desegregation (i,e, » less than one year compared 
to more than one year). Some support was found for the finding of the 
CoJeman Report that effects are greatest in the most integrated 
environments. 

What do these findings mean? The effect size fcoiid in both 
analyses reported here indicates about a ^two-month gain or benefit for 
desegregated students. The meaning attac'hed to this finding represents 
a judgment. This is where social science ends and social policy begins* 
Houever» we have examined the scientific literature on coronary-artery 
bypass graft surgery for comparative purposes. This is a widely 
acc^ipted medical procedure that Is currently performed on well over 
IOC, 000 persons annually at a cost of nearly $2 billion. Much of this 
expeniie Is reimbursed by third-party payers including the federal 
govemiDent, a research synthesis of the higher-quality studies t 
randoml::ed) found a benefit of ,8 standard deviations representing only 
a 4,4 percent Increase in survival rates (Wortman an,d Yeaton. in press). 
This is a modest increase at a considerable social cosjt when compared to 
school desegregation. Moreover/ programs aimed at the young such as 
school desegregation typically are more cost effective than those for 
elderly such as bypass surgery. 

Although the nethods developed above have been useful in dealing with 
problems of student ^iquivalence , they cannot adjust for the second major 
problem noted by St, John (1975) of "equivalence of schools," The 
actual details of the educational programs involved in the desegregation 
studies are not reported. Thus It is not possible to determine 
effective from ineffective programs. The real problem as Gerard and 
Miller (1975) conclude is "to foster integration of the minority 
children into the classroom social structure and academic program," 
Recent studies have addressed this issue and developed procedures for 
improving educational practice in desegregated c>cIassrooms (Aronson and 
Bridgeirant 1979; Slavin and Marlden, 1979), A number of the papers by 
members of the NIK expert panel focused on these procedures. Such 
research based on sound social science theory is likely to lead to 
increased educational benefits for desegregated students. 

The polit;Lcal reality confronting the achievement of school 
desegregation today is the need to allow students in highly segregated 
urban inner cities access to schcfols in the surrounding white collar 
suburbs. Such "metropolitan plans" have been found to achieve 
desegregation without white flight. They are also quite controversial 
and typically require cross-district busing. The result* in St, Louis 
are encouraging. Here voluntary cross-district busing combined with 
inner city magnet schools have produced two-way desegregation with some 
Whites returning to the city schools. It should be noted that the plan 
i5 an alternative to court-ordered mandatory metropolitan desegregation, 
Moreever* it should be added that such plans resemble the early 
voluntary plans in the Northeast, As a social policy, these plans 
"Capitalizing on good suburban schools^ a cooperative environment* and 
motivated volunteers — produced the largest effects of , the studies 
examin€td* 
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FOOTNOTES 



' Cohan's estimate of effect size. is nearly icTentical, The 
denominator includes information from both treatment and control groups* 
the pooled-withln standard deviation. Hedges (19S2) maintains that this 
produces a less biased estimate of effect. However* this estimator 
ignores problems caused by the effect of the treatment on the 
experimental (i,e«, desegregated) group standard deviation, 

^Unfortunately, it was not possible to calculate effect r.izes from this 
study either since standard deviations were not reported. Similar 
problems plague the earlier reports as well. 

^In factt one of the '^neutral" members had testified nuiaerous times 
against desegregation in court cases. 
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